实战Spark分布式SQL引擎,实战sparksql


一、概览 Spark SQL除了使用spark-sql命令进入交互式执行环境之外,还能够使用JDBC/ODBC或命令行接口进行分布式查询,在这个模式下,终端用户或应用可以直接和Spark SQL进行交互式SQL查询而不需要写任何scala代码。
二、使用Thrift JDBC server spark版本    :1.4.0 Yarn版本     :CDH5.4.0 1、准备工作 将hive-site.xml拷贝或link到$SPARK_HOME/conf下
2、使用spark安装目录下脚本启动hive thrift server,默认不加参数时,会以local模式启动,占用本地一个JVM进程
sbin/start-thriftserver.sh

3、yarn-client模式启动,默认启动在10001端口
sbin/start-thriftserver.sh --master yarn
接下来,我们观察yarn UI的UI上,启动了25个container

为什么启动了一个JDBC服务就占用这么多资源呢?这是因为conf/spark-env.sh中配置了SPARK_EXECUTOR_INSTANCES为24个实例,再加上一个yarn client的driver实例
export SPARK_EXECUTOR_INSTANCES=24
观察Yarn NodeManager节点上的进程,thriftserver会常驻一个叫org.apache.spark.executor.CoarseGrainedExecutorBackend的进程,随时为之后的SQL作业启动Task。这样做的好处是运行Spark SQL时,减少了启动container上的时间消耗,同时代价是在thrift server空闲的时候,这些container资源仍然占用着不会释放给其他spark或mapreduce作业使用。


4、使用beeline连接Spark SQL交互式引擎
bin/beeline -u jdbc:hive2://localhost:10001 -n root -p root
注意,在非安全Hadoop模式下,用户名使用当前系统用户,密码为空或随意传值都可以;在kerberos Hadoop模式下,需要传递有效的principal令牌才可以登录beeline。
三、命令行帮助 1、Thrift server
Mandatory arguments to long options are mandatory for short options too.  -a, --all                  do not ignore entries starting with .  -A, --almost-all           do not list implied . and ..      --author               with -l, print the author of each file  -b, --escape               print octal escapes for nongraphic characters      --block-size=SIZE      use SIZE-byte blocks.  See SIZE format below  -B, --ignore-backups       do not list implied entries ending with ~  -c                         with -lt: sort by, and show, ctime (time of last                               modification of file status information)                               with -l: show ctime and sort by name                               otherwise: sort by ctime  -C                         list entries by columns      --color[=WHEN]         colorize the output.  WHEN defaults to `always'                               or can be `never' or `auto'.  More info below  -d, --directory            list directory entries instead of contents,                               and do not dereference symbolic links  -D, --dired                generate output designed for Emacs' dired mode  -f                         do not sort, enable -aU, disable -ls --color  -F, --classify             append indicator (one of */=>@|) to entries      --file-type            likewise, except do not append `*'      --format=WORD          across -x, commas -m, horizontal -x, long -l,                               single-column -1, verbose -l, vertical -C      --full-time            like -l --time-style=full-iso  -g                         like -l, but do not list owner      --group-directories-first                             group directories before files.                               augment with a --sort option, but any                               use of --sort=none (-U) disables grouping  -G, --no-group             in a long listing, don't print group names  -h, --human-readable       with -l, print sizes in human readable format                               (e.g., 1K 234M 2G)      --si                   likewise, but use powers of 1000 not 1024  -H, --dereference-command-line                             follow symbolic links listed on the command line      --dereference-command-line-symlink-to-dir                             follow each command line symbolic link                             that points to a directory      --hide=PATTERN         do not list implied entries matching shell PATTERN                               (overridden by -a or -A)      --indicator-style=WORD  append indicator with style WORD to entry names:                               none (default), slash (-p),                               file-type (--file-type), classify (-F)  -i, --inode                print the index number of each file  -I, --ignore=PATTERN       do not list implied entries matching shell PATTERN  -k                         like --block-size=1K  -l                         use a long listing format  -L, --dereference          when showing file information for a symbolic                               link, show information for the file the link                               references rather than for the link itself  -m                         fill width with a comma separated list of entries  -n, --numeric-uid-gid      like -l, but list numeric user and group IDs  -N, --literal              print raw entry names (don't treat e.g. control                               characters specially)  -o                         like -l, but do not list group information  -p, --indicator-style=slash                             append / indicator to directories  -q, --hide-control-chars   print ? instead of non graphic characters      --show-control-chars   show non graphic characters as-is (default                             unless program is `ls' and output is a terminal)  -Q, --quote-name           enclose entry names in double quotes      --quoting-style=WORD   use quoting style WORD for entry names:                               literal, locale, shell, shell-always, c, escape  -r, --reverse              reverse order while sorting  -R, --recursive            list subdirectories recursively  -s, --size                 print the allocated size of each file, in blocks  -S                         sort by file size      --sort=WORD            sort by WORD instead of name: none -U,                             extension -X, size -S, time -t, version -v      --time=WORD            with -l, show time as WORD instead of modification                             time: atime -u, access -u, use -u, ctime -c,                             or status -c; use specified time as sort key                             if --sort=time      --time-style=STYLE     with -l, show times using style STYLE:                             full-iso, long-iso, iso, locale, +FORMAT.                             FORMAT is interpreted like `date'; if FORMAT is                             FORMAT1<newline>FORMAT2, FORMAT1 applies to                             non-recent files and FORMAT2 to recent files;                             if STYLE is prefixed with `posix-', STYLE                             takes effect only outside the POSIX locale  -t                         sort by modification time  -T, --tabsize=COLS         assume tab stops at each COLS instead of 8  -u                         with -lt: sort by, and show, access time                               with -l: show access time and sort by name                               otherwise: sort by access time  -U                         do not sort; list entries in directory order  -v                         natural sort of (version) numbers within text  -w, --width=COLS           assume screen width instead of current value  -x                         list entries by lines instead of by columns  -X                         sort alphabetically by entry extension  -1                         list one file per line SELinux options:   --lcontext                 Display security context.   Enable -l. Lines                             will probably be too wide for most displays.  -Z, --context              Display security context so it fits on most                             displays.  Displays only mode, user, group,                             security context and file name.  --scontext                 Display only security context and file name.      --help     display this help and exit      --version  output version information and exit
2、beeline
   -u <database url>               the JDBC URL to connect to   -n <username>                   the username to connect as   -p <password>                   the password to connect as   -d <driver class>               the driver class to use   -e <query>                      query that should be executed   -f <file>                       script file that should be executed   --hiveconf property=value       Use value for given property   --hivevar name=value            hive variable name and value                                   This is Hive specific settings in which variables                                   can be set at session level and referenced in Hive                                   commands or queries.   --color=[true/false]            control whether color is used for display   --showHeader=[true/false]       show column names in query results   --headerInterval=ROWS;          the interval between which heades are displayed   --fastConnect=[true/false]      skip building table/column list for tab-completion   --autoCommit=[true/false]       enable/disable automatic transaction commit   --verbose=[true/false]          show verbose error messages and debug info   --showWarnings=[true/false]     display connection warnings   --showNestedErrs=[true/false]   display nested errors   --numberFormat=[pattern]        format numbers using DecimalFormat pattern   --force=[true/false]            continue running script even after errors   --maxWidth=MAXWIDTH             the maximum width of the terminal   --maxColumnWidth=MAXCOLWIDTH    the maximum width to use when displaying columns   --silent=[true/false]           be more silent   --autosave=[true/false]         automatically save preferences   --outputformat=[table/vertical/csv/tsv]   format mode for result display   --isolation=LEVEL               set the transaction isolation level   --nullemptystring=[true/false]  set to true to get historic behavior of printing null as empty string   --help                          display this message


版权声明:本文为博主原创文章,未经博主允许不得转载。

相关内容