引言 由于工作需要,即将拥抱Spark,曾经进行过相关知识的学习,现在计划详细读一遍最新版本Spark1.3的部分官方文档,一是复习,二是了解最新进展,三是为公司团队培训做储备。
原文URL:http://spark.apache.org/docs/latest/cluster-overview.html 该文档重点介绍了Spark集群架构中的各个关键组件,对于我们理解Spark的运行原理至关重要。
Spark cluster components
  • 每个应用程序都有它自己的一组executor进程,这组executor进程在应用的全生命周期内运行,负责以多线程方式运行自己分配到的task。这样做的好处是可以隔离不同的应用程序。然而,这也意味着,数据不能跨应用程序(SparkContext的实例)共享,除非将数据写入一个外部的存储系统(比如Tachyon?)。
  • 因为driver程序负责调度在集群上运行的tasks,所以driver应该贴近worker节点运行,最好在相同的局域网内,否则两者相隔较远的话,driver和worker之间的通信会对程序执行带来影响。

Spark支持三种类型的Cluster Manager:
  • Standalone – Spark自带的相对简单的集群管理器;
  • Apache Mesos – 一个通用的集群管理器,它同时也可以运行Hadoop MR程序和其它服务应用;
  • Hadoop YARN – Hadoop2.0中的资源管理器。

我们可以通过spark-submit脚本提交应用程序到任意一类Cluster Manager上。

Spark应用程序的监控:每一个driver程序都有一个web UI,经典的在4040端口,展示了正在运行的task、executors和存储情况的相关信息。我们可以简单的通过http://<driver-node>:4040在浏览器中进入web UI,如下图:

  • 跨应用程序级别(在Cluster Manager级别);
  • 应用程序内部级别(如果多个计算任务同时在一个SparkContext上发生)。


Term Meaning
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application = 一个driver + 多个executor
Application jar A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
一个外部服务,用于获取集群资源,例如standalone manager, Mesos, YARN
Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
  • 在cluster模式中,Spark框架在集群中启动driver进程。
  • 在client模式中,submitter在集群外启动driver进程。
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
为运行application在worker nodes上启动的线程,这些线程负责运行tasks,及保存数据在内存或磁盘。
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save,collect); you'll see this term used in the driver's logs.
Spark action(例如save,collect)触发的一系列并行计算tasks。
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
每一个job将分为很多组task,每一组task称为一个stage,这些stage相互依赖。类似于MapReduce计算模型中的map和reduce stage。

