【甘道夫】Spark1.3.0 Running Spark on YARN 官方文档精华摘要，spark1.3.0yarn

文章由LinuxBoy分享于2019-03-27 08:03:08热评（170）

【甘道夫】Spark1.3.0 Running Spark on YARN 官方文档精华摘要，spark1.3.0yarn

引言由于工作需要，即将拥抱Spark，曾经进行过相关知识的学习，现在计划详细读一遍最新版本Spark1.3的部分官方文档，一是复习，二是了解最新进展，三是为公司团队培训做储备。

欢迎转载，请注明出处： http://blog.csdn.net/u010967382/article/details/45062407

原文URL：http://spark.apache.org/docs/latest/running-on-yarn.html 该文档重点介绍如何在YARN上运行Spark程序。
先回顾YARN的架构：

这里不详述YARN的架构，请自行查阅，重点理解灰色的部分即可，灰色的部分是Container，是YARN体系中资源分为的单元，一个程序，比如图中的MR程序，由一个负责总调度的MR Application Master和多个具体执行任务的Map/Reduce Tasks组成。
从0.6.0版本以后，Spark就提供了对YARN的支持，并在后续版本中持续改进。
运行Spark on YARN需要一个支持YARN功能的Spark的发行版，可以到官网下载（http://spark.apache.org/downloads.html），也可以自行编译（http://spark.apache.org/docs/latest/building-spark.html）。
关于Spark on YARN的配置，基础配置都和别的模式一样，有一些为YARN专门设计的参数，参见下表，我就不翻译了，需要的时候自行查阅：

Property Name	Default	Meaning
`spark.yarn.am.memory`	512m	Amount of memory to use for the YARN Application Master in client mode, in the same format as JVM memory strings (e.g. `512m`, `2g`). In cluster mode, use `spark.driver.memory` instead.
`spark.driver.cores`	1	Number of cores used by the driver in YARN cluster mode. Since the driver is run in the same JVM as the YARN Application Master in cluster mode, this also controls the cores used by the YARN AM. In client mode, use`spark.yarn.am.cores` to control the number of cores used by the YARN AM instead.
`spark.yarn.am.cores`	1	Number of cores to use for the YARN Application Master in client mode. In cluster mode, use `spark.driver.cores` instead.
`spark.yarn.am.waitTime`	100000	In yarn-cluster mode, time in milliseconds for the application master to wait for the SparkContext to be initialized. In yarn-client mode, time for the application master to wait for the driver to connect to it.
`spark.yarn.submit.file.replication`	The default HDFS replication (usually 3)	HDFS replication level for the files uploaded into HDFS for the application. These include things like the Spark jar, the app jar, and any distributed cache files/archives.
`spark.yarn.preserve.staging.files`	false	Set to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather than delete them.
`spark.yarn.scheduler.heartbeat.interval-ms`	5000	The interval in ms in which the Spark application master heartbeats into the YARN ResourceManager.
`spark.yarn.max.executor.failures`	numExecutors * 2, with minimum of 3	The maximum number of executor failures before failing the application.
`spark.yarn.historyServer.address`	(none)	The address of the Spark history server (i.e. host.com:18080). The address should not contain a scheme (http://). Defaults to not being set since the history server is an optional service. This address is given to the YARN ResourceManager when the Spark application finishes to link the application from the ResourceManager UI to the Spark history server UI.
`spark.yarn.dist.archives`	(none)	Comma separated list of archives to be extracted into the working directory of each executor.
`spark.yarn.dist.files`	(none)	Comma-separated list of files to be placed in the working directory of each executor.
`spark.executor.instances`	2	The number of executors. Note that this property is incompatible with`spark.dynamicAllocation.enabled`.
`spark.yarn.executor.memoryOverhead`	executorMemory * 0.07, with minimum of 384	The amount of off heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
`spark.yarn.driver.memoryOverhead`	driverMemory * 0.07, with minimum of 384	The amount of off heap memory (in megabytes) to be allocated per driver in cluster mode. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%).
`spark.yarn.am.memoryOverhead`	AM memory * 0.07, with minimum of 384	Same as `spark.yarn.driver.memoryOverhead`, but for the Application Master in client mode.
`spark.yarn.queue`	default	The name of the YARN queue to which the application is submitted.
`spark.yarn.jar`	(none)	The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to "hdfs:///some/path".
`spark.yarn.access.namenodes`	(none)	A list of secure HDFS namenodes your Spark application is going to access. For example, `spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032`. The Spark application must have acess to the namenodes listed and Kerberos must be properly configured to be able to access them (either in the same realm or in a trusted realm). Spark acquires security tokens for each of the namenodes so that the Spark application can access those remote HDFS clusters.
`spark.yarn.appMasterEnv.[EnvironmentVariableName]`	(none)	Add the environment variable specified by `EnvironmentVariableName` to the Application Master process launched on YARN. The user can specify multiple of these and to set multiple environment variables. In yarn-cluster mode this controls the environment of the SPARK driver and in yarn-client mode it only controls the environment of the executor launcher.
`spark.yarn.containerLauncherMaxThreads`	25	The maximum number of threads to use in the application master for launching executor containers.
`spark.yarn.am.extraJavaOptions`	(none)	A string of extra JVM options to pass to the YARN Application Master in client mode. In cluster mode, use spark.driver.extraJavaOptions instead.
`spark.yarn.maxAppAttempts`	yarn.resourcemanager.am.max-attempts in YARN	The maximum number of attempts that will be made to submit the application. It should be no larger than the global number of max attempts in the YARN configuration.

要想在YARN上运行Spark程序，首先要确保HADOOP_CONF_DIR或者YARN_CONF_DIR环境变量已经正确指向了Hadoop的配置目录（export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop）。
接下来就是决定yarn-xxx模式。根据driver进程所处的位置，我们可以有两种启动Spark on YARN的模式：

yarn-client模式：在yarn-client模式中，driver运行在client进程中，Application Master（Application Master是YARN架构的一部分，是运行在YARN中各个应用程序的调度器）仅仅用于向YARN申请资源（client在控制台可以看到程序打印输出）。

yarn-cluster模式：在yarn-cluster模式中，Spark driver运行在 Application Master进程中（client控制台看不到程序打印的输出）。

注意：和Spark standalone和Mesos模式不同，这两个模式的master地址都用master参数指定，在YARN模式中，资源管理器的地址是从Hadoop的配置文件中读取的，所以spark-submit脚本中的--master参数仅需要指定yarn-client或者yarn-cluster。
以下命令格式可以用yarn-cluster模式启动一个Spark on YARN应用程序： ./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options] 例如： $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ --queue thequeue \ lib/spark-examples*.jar \ 10 上面的脚本启动了一个YARN cluster模式的应用程序，SparkPi将会作为Application Master的子线程运行。Client（在集群外）将周期性的轮询Application Master来获取最新的应用程序执行状态，并将接受到的状态信息展示在控制台上。Client会在应用程序结束运行时退出。
在yarn-cluster模式中，driver和client运行在不同的机器中，所以driver中的SparkContext.addJar对于client机器的文件就不起作用了。为了使得driver中的SparkContext.addJar可以访问client机器中的文件，可以在spark-submit脚本中使用 --jars 参数，例如：

$ ./bin/spark-submit --class my.main.Class \
    --master yarn-cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar
    my-main-jar.jar
    app_arg1 app_arg2

推荐文章：

【甘道夫】Spark1.3.0 Running Spark on YARN 官方文档精华摘要，spark1.3.0yarn