Spark学习笔记1--编译源代码


要学习一个框架最好的方式就是调试其源代码。

编译Spark 0.81  with hadoop2.2.0

本机环境:

1.eclipse kepler

2.maven3.1

3.scala2.9.3

4.ubuntu12.04

步骤:

1. 先从网上下载spark0.81的源代码.  下载方式:_

2.  upzip v0.8.1-incubating.zip

3.  export MAVEN_OPTS="-Xmx1g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"  //这里-Xmx自己设置,我设置的是1G,机子比较旧。。。。推荐2G,如果jvm挂了,还是设置为1g把,慢就慢点了。

victor@victor-ubuntu:~/software/incubator-spark-0.8.1-incubating$ export MAVEN_OPTS="-Xmx1g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

4.  maven就是好用,mvn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0  -Pnew-yarn -DskipTests package

victor@victor-ubuntu:~/software/incubator-spark-0.8.1-incubating$ mvn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0  -Pnew-yarn -DskipTests package

5. .........这里我等了N久,本来就凌晨了,差点睡着。。。

最终编译成功。

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .......................... SUCCESS [5.742s]
[INFO] Spark Project Core ................................ SUCCESS [6:55.638s]
[INFO] Spark Project Bagel ............................... SUCCESS [57.687s]
[INFO] Spark Project Streaming ........................... SUCCESS [1:59.625s]
[INFO] Spark Project ML Library .......................... SUCCESS [1:12.154s]
[INFO] Spark Project Examples ............................ SUCCESS [4:01.735s]
[INFO] Spark Project Tools ............................... SUCCESS [18.163s]
[INFO] Spark Project REPL ................................ SUCCESS [59.977s]
[INFO] Spark Project YARN Support ........................ SUCCESS [1:24.402s]
[INFO] Spark Project Assembly ............................ SUCCESS [47.046s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18:42.710s
[INFO] Finished at: Fri Mar 28 00:47:06 CST 2014
[INFO] Final Memory: 64M/560M
[INFO] ------------------------------------------------------------------------

然后用sbt(simple build tool)

victor@victor-ubuntu:~/software/incubator-spark-0.8.1-incubating$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true ./sbt/sbt assembly
Getting org.scala-sbt sbt 0.12.4 ...

过了1个多小时。。。。。。。。。。。

[info] Checking every *.class/*.jar file's SHA-1.
[info] SHA-1: 040d65230771f2da5c90328a4e4ea844a489f39e
[info] Packaging /home/victor/software/incubator-spark-0.8.1-incubating/examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar ...
[info] Done packaging.



[info] Done packaging.
[success] Total time: 4488 s, completed Mar 28, 2014 2:18:46 AM

打包完成后,在assembly/target/scala-2.9.3/目录下会生成两个jar包,其中一个是spark-assembly-0.8.1-incubating-hadoop2.2.0.jar,examples/target/scala-2.9.3/下面也有一个jar包:spark-examples-assembly-0.8.1-incubating.jar,接下来将重点使用这两个包。

victor@victor-ubuntu:~/software/incubator-spark-0.8.1-incubating/assembly/target/scala-2.9.3$ ll
total 90504
drwxrwxr-x 3 victor victor     4096  3月 28 21:43 ./
drwxrwxr-x 9 victor victor     4096  3月 28 01:27 ../
drwxrwxr-x 3 victor victor     4096  3月 28 01:27 cache/
-rw-rw-r-- 1 victor victor 92659663  3月 28 02:06 spark-assembly-0.8.1-incubating-hadoop2.2.0.jar



victor@victor-ubuntu:~/software/incubator-spark-0.8.1-incubating/examples/target/scala-2.9.3$ ll
total 179004
drwxrwxr-x 5 victor victor      4096  3月 28 01:59 ./
drwxrwxr-x 8 victor victor      4096  3月 28 01:26 ../
drwxrwxr-x 3 victor victor      4096  3月 28 01:23 cache/
drwxrwxr-x 4 victor victor      4096  3月 28 00:40 classes/
-rw-rw-r-- 1 victor victor  59982904  3月 28 00:43 spark-examples_2.9.3-assembly-0.8.1-incubating.jar
-rw-rw-r-- 1 victor victor 123286056  3月 28 02:18 spark-examples-assembly-0.8.1-incubating.jar
drwxrwxr-x 3 victor victor      4096  3月 28 00:41 test-classes/

将以下文件夹放到一个文件夹spark_client作为客户端。conf/assembly/target/scala-2.9.3/ 只需拷贝jar包examples/target/scala-2.9.3/只需拷贝jar包spark-class文件

保证:conf目录、spark-class文件,assembly目录(内部有target目录)、examples目录(内部有target目录)要写一个脚本来运行spark程序,就用example的例子把。详见我的下一篇,运行篇---->Spark学习笔记2--计算Pi


A Note About Hadoop Versions

Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported storage systems. Because the HDFS protocol has changed in different versions of Hadoop, you must build Spark against the same version that your cluster uses. By default, Spark links to Hadoop 1.0.4. You can change this by setting the SPARK_HADOOP_VERSION variable when compiling:

SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly

In addition, if you wish to run Spark on YARN, set SPARK_YARN to true:

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

Note that on Windows, you need to set the environment variables on separate lines, e.g., set SPARK_HADOOP_VERSION=1.2.1.

For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to build Spark and publish it locally. See Launching Spark on YARN. This is needed because Hadoop 2.2 has non backwards compatible API changes.


参考文献:

http://dongxicheng.org/framework-on-yarn/build-spark-on-hadoop-2-yarn/

https://spark.apache.org/docs/0.8.1/index.html#a-note-about-hadoop-versions


<原创,转载请注明出处http://blog.csdn.net/oopsoom/article/details/22345777>

相关内容