在Hadoop 2.3上运行C++程序各种疑难杂症(Hadoop Pipes选择、错误集锦、Hadoop2.3编译等)


首记

感觉Hadoop是一个坑,打着大数据最佳解决方案的旗帜到处坑害良民。记得以前看过一篇文章,说1TB以下的数据就不要用Hadoop了,体现不出太大的优势,有时候反而会成为累赘。因此Hadoop的使用场所一般有两:一是有一定规模的公司,数据流一般是TB级别的,这样的公司其实不多;二是各大高校的实验室,作为研究使用。不幸的我也走上了这条路,仅为研究之用。而且我的使用需求还不是一般的在Hadoop下开发应用程序,而是开发好的C++程序要放到Hadoop平台下进行测试。Hadoop是基于Java的数据计算平台,当然对Java支持的最好,如果要运行C++程序,有三种解决方案:

  • 使用JNI/JNA/JNR技术。这三种Java外部函数接口技术都是解决在Java程序中运行C++功能函数的需求,从而使得在Hadoop平台下开发Java程序且能调用C++函数完成在Hadoop Java版应用中运行C++程序的目的。早期(大概11年)阿里就是使用该技术将C语言实现的分词软件成功部署到Hadoop平台下运行,详情请看参考资料1。用过JNI的都知道JNI实在不好用,所以后来有人开发了另外两种Java外部函数接口,即JNA和JNR,具体介绍和使用实例请看参考资料2和参考资料3。
  • 使用Hadoop Streaming技术。这项技术可以使得除了Java之外的多种其它语言如C/C++/Python/C#甚至shell脚本等运行在Hadoop平台下,程序只需要按照一定的格式从标准输入读取数据、向标准输出写数据就可以在Hadoop平台上使用,原有的单机程序稍加改动就可以在Hadoop平台进行分布式处理。
  • 使用Hadoop Pipes技术。该技术只专注于在Hadoop平台下运行C++程序,只允许用户使用C++语言进行MapReduce程序设计。它采用的主要方法是将应用逻辑相关的C++代码放在单独的进程中,然后通过Socket让Java代码与C++代码通信。从很大程度上说,这种方法类似于Hadoop Streaming,不同之处是通信方式不同:一个是标准输入输出,另一个是socket。

对于这三种技术我都做了相关调研和比较分析,首先排除了方法1,因为我不想写Java程序来调用C++功能,有些累赘,调试非常不方便,那么是用Hadoop Streaming技术还是Hadoop Pipes技术呢?两种方式各有优缺点,具体可看参考资料4(说的不太准确,仅供参考),为了准确选择,我需要将两种方法都试验一下,看哪个适合自己的需求再做最终决定。

首先选择了仅专注C++的Hadoop Pipes技术,于是就开启了下面一系列的过程……


Hadoop 2.3环境配置安装

我之前配置过Hadoop环境(看参考资料5),但那时用的版本是1.1.2,比较老的版本,一堆bug,为了避免遗留bug的困扰我选择了最新版2.3,因此需要重新配置(主要借鉴参考资料6),由于2.3版本使用的是新MapReduce框架yarn,因此配置与之前有所差异。

配置成功后,用自带的经典案例wordcount测试下运行是否正常(已上传数据):

hadoop jar ./hadoop-2.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar wordcount wc_input.txt out
结果出现了下面错误:
2014-04-03 21:19:40,847 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManagerorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From Slave1/192.168.1.152 to 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:185) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:199) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:354) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:401)Caused by: java.net.ConnectException: Call From Slave1/192.168.1.152 to 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1359) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy23.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy24.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:247) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:179) ... 6 moreCaused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696) at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1377) ... 18 more

是连接拒绝问题,连接了“0.0.0.0:8031”这个地址,这显然是某个选项没配置连接地址,导致使用了默认的错误地址。经过查资料,yarn-site.xml文件没配置好,其中的yarn.nodemanager.address没有配置,默认是0.0.0.0。正确的配置如下(可能有些项不需要配置,但为了保险还是大部分都配置了)
<configuration><property>  <name>yarn.resourcemanager.address</name>  <value>192.168.1.137:8032</value></property><property>  <name>yarn.resourcemanager.scheduler.address</name>  <value>192.168.1.137:8030</value></property><property>  <name>yarn.resourcemanager.resource-tracker.address</name>  <value>192.168.1.137:8031</value></property><property>  <name>yarn.resourcemanager.admin.address</name>  <value>192.168.1.137:8033</value></property><property>       <name>yarn.resourcemanager.webapp.address</name>  <value>192.168.1.137:8088</value></property><property>    <name>yarn.nodemanager.aux-services</name>    <value>mapreduce_shuffle</value></property><property>    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>    <value>org.apache.hadoop.mapred.ShuffleHandler</value></property></configuration>
然后就运行成功了。开始以为环境就这样配好了,直到我测试Hadoop pipes程序……


Hadoop pipes运行错误集锦

Hadoop 2.3发布版本没有自带Hadoop pipes的wordcount例子,于是我就在老的1.1.2版本中找来了wordcount-simple.cc例子,修改了头文件路径(新版本与旧版本路径完全不一样),并参考资料7使用下面的makefile编译(其中的-lssl也许非必需):

HADOOP_INSTALL=/opt/hadoop

CC = g++
CCFLAGS = -I$(HADOOP_INSTALL)/include

wordcount :wordcount-simple.cc
        $(CC) $(CCFLAGS) $< -Wall -L$(HADOOP_INSTALL)/lib/native -lhadooppipes -lhadooputils -lpthread -lcrypto -lssl -g -O2 -o $@
编译成功后,上传到HDFS上,用下面的命令测试运行:

hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true  \
                       -D mapred.job.name=wordcount -input /data/wc_in -output /data/wc_out2  -program /bin/wordcount
开始一直卡在”map 0% reduce 0%“,不知道过了多久,报出了下面一系列错误:

DEPRECATED: Use of this script to execute mapred command is deprecated.Instead use the mapred command for it.14/04/03 23:59:48 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.137:803214/04/03 23:59:49 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.137:803214/04/03 23:59:50 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 14/04/03 23:59:50 INFO mapred.FileInputFormat: Total input paths to process : 214/04/03 23:59:51 INFO mapreduce.JobSubmitter: number of splits:214/04/03 23:59:51 INFO Configuration.deprecation: hadoop.pipes.java.recordreader is deprecated. Instead, use mapreduce.pipes.isjavarecordreader14/04/03 23:59:51 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name14/04/03 23:59:51 INFO Configuration.deprecation: hadoop.pipes.java.recordwriter is deprecated. Instead, use mapreduce.pipes.isjavarecordwriter14/04/03 23:59:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1396578697573_000414/04/03 23:59:52 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.14/04/03 23:59:53 INFO impl.YarnClientImpl: Submitted application application_1396578697573_000414/04/03 23:59:53 INFO mapreduce.Job: The url to track the job: http://Master:8088/proxy/application_1396578697573_0004/14/04/03 23:59:53 INFO mapreduce.Job: Running job: job_1396578697573_000414/04/04 00:00:26 INFO mapreduce.Job: Job job_1396578697573_0004 running in uber mode : false14/04/04 00:00:26 INFO mapreduce.Job: map 0% reduce 0%14/04/04 00:10:53 INFO mapreduce.Job: map 100% reduce 0%14/04/04 00:10:53 INFO mapreduce.Job: Task Id : attempt_1396578697573_0004_m_000001_0, Status : FAILEDAttemptID:attempt_1396578697573_0004_m_000001_0 Timed out after 600 secs14/04/04 00:10:54 INFO mapreduce.Job: Task Id : attempt_1396578697573_0004_m_000000_0, Status : FAILEDAttemptID:attempt_1396578697573_0004_m_000000_0 Timed out after 600 secs14/04/04 00:10:55 INFO mapreduce.Job: map 0% reduce 0%14/04/04 00:21:23 INFO mapreduce.Job: map 100% reduce 0%14/04/04 00:21:24 INFO mapreduce.Job: Task Id : attempt_1396578697573_0004_m_000000_1, Status : FAILEDAttemptID:attempt_1396578697573_0004_m_000000_1 Timed out after 600 secs14/04/04 00:21:24 INFO mapreduce.Job: Task Id : attempt_1396578697573_0004_m_000001_1, Status : FAILEDAttemptID:attempt_1396578697573_0004_m_000001_1 Timed out after 600 secs14/04/04 00:21:25 INFO mapreduce.Job: map 0% reduce 0%14/04/04 00:31:53 INFO mapreduce.Job: Task Id : attempt_1396578697573_0004_m_000000_2, Status : FAILEDAttemptID:attempt_1396578697573_0004_m_000000_2 Timed out after 600 secs14/04/04 00:31:53 INFO mapreduce.Job: Task Id : attempt_1396578697573_0004_m_000001_2, Status : FAILEDAttemptID:attempt_1396578697573_0004_m_000001_2 Timed out after 600 secs14/04/04 00:42:24 INFO mapreduce.Job: map 100% reduce 0%14/04/04 00:42:25 INFO mapreduce.Job: map 100% reduce 100%14/04/04 00:42:26 INFO mapreduce.Job: Job job_1396578697573_0004 failed with state FAILED due to: Task failed task_1396578697573_0004_m_000000Job failed as tasks failed. failedMaps:1 failedReduces:014/04/04 00:42:27 INFO mapreduce.Job: Counters: 9 Job Counters  Failed map tasks=8  Launched map tasks=8  Other local map tasks=6  Data-local map tasks=2  Total time spent by all maps in occupied slots (ms)=5017539  Total time spent by all reduces in occupied slots (ms)=0  Total time spent by all map tasks (ms)=5017539  Total vcore-seconds taken by all map tasks=5017539  Total megabyte-seconds taken by all map tasks=5137959936Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:264) at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:503) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:518)
通过网上各种资料(之所以在网上找解决方案是因为在namenode、datanode及yarn的日志里都没能找到错误信息及有效信息,所以很不解到底是什么原因导致的错误),解决方案都不靠谱,都没能解决该问题。最后自己一阵捣鼓,最终在一个很隐蔽的日志里看到了错误提示,该日志在datanode的Hadoop安装路径下logs目录下的userlogs文件夹里,开始以为这是个无用的文件夹,后来发现这个文件夹解决了我所有的问题,该文件夹下面包含了很多application子文件夹,而子文件夹下面又包含了很多container子文件夹,container下面就是标准的三个输出日志”stderr“、”stdout“和”syslog“,也就是这样的目录结构(datanode的container logs):

|-- Hadoop安装路径

    |-- logs

        |- -userlogs

            |-- application_XXXXXXXX

                 |-- container_XXXXXXXX

                      |-- stderr

                      |-- stdout

                      |-- syslog


其中stderr是最重要的文件,具体的错误原因就保存在这个文件里面。比如上面的错误信息在该文件里找到了这样的错误提示(...表示省略无关信息):

.../application_1396607014314_0001/container_1396607014314_0001_01_000002/wordcount: 
error while loading shared libraries: libcrypto.so.10: cannot open shared object file: No such file or directory
在datanode上用”locate libcrypto.so.10“搜了下的确没找到,但在namenode上找到了这个共享库文件,于是就将namenode下的该文件拷贝到了datanode的/usr/lib下面,重新运行程序,还是同样的问题,卡住了最后报出了那样的错误,于是到datanode下面的container logs里找原因,发现了新的问题:
/usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.11' not found (required by .../wordcount)
同样的还是namenode里有而datanode里没有,因此还是将namenode的libstdc++.so.6拷过去了,再次运行,错误依旧,这时原因是:
error while loading shared libraries: /usr/lib/libstdc++.so.6: ELF file OS ABI invalid
这时候我没有去搜相应的解决方案,我似乎知道了问题的症结所在,每次问题都是datanode没有相关的库文件而namenode都有,将namenode上的拷过去还可能造成与系统不匹配无效的后果,那么问题就应该出在:namenode(master)和datanode(slave)云节点系统内核的不匹配。我的master节点是Ubuntu 13.04而slave节点是特别旧的Ubuntu 9.11,虽然都是Ubuntu但是内核已经不一样,各个库文件已经得到升级和改动,而Hadoop在运行程序时需要master和slave之间的通信互动,采用的请求机制是一样的,比如master在请求一个动作时用到了libcrypto.so.10这个版本的库文件,那么slave答复或请求master时也会去自己的系统上找这样的库文件,结果自己的系统上没有,根据Hadoop运行机制就有个timeout和继续重新请求的过程,但是一直请求不到,所以就一直卡在那个地方,直到最终在限定的时间里没有完成任务导致的”Job failed“。可想而知,如果你master用的是Ubuntu而slave用的是Fedora,那么这样的错误也非常可能发生。那么为什么之前运行普通的Java版wordcount程序没有发生错误呢?这我就不清楚了,毕竟不知道其具体的内部运行机理。我猜测可能是Hadoop pipes采用的是socket通信方式,所以需要用到那些库文件,而普通的Java版程序不需要这样的过程,因此没有发生错误,当然也不排除运行稍复杂的涉及通信的Java程序也发生这样的错误。

因此,我也得出这样一个结论:

Hadoop环境的master节点和所有slave节点的系统环境最好一模一样(同一个系统同一个版本号),对于Hadoop pipes而言就是必须一模一样

由于我用的都是虚拟机系统,因此需要环境一样的最简单方式就是——克隆(clone)虚拟机,使用链接link方式即可。然后重新配置Hadoop 2.3环境,重新运行,以为问题就此解决,可是新的问题又来了:

14/04/06 01:11:49 INFO mapreduce.Job: Running job: job_1396756477966_000214/04/06 01:12:04 INFO mapreduce.Job: Job job_1396756477966_0002 running in uber mode : false14/04/06 01:12:04 INFO mapreduce.Job: map 0% reduce 0%14/04/04 09:59:02 INFO mapreduce.Job: Task Id : attempt_1396618478715_0002_m_000000_0, Status : FAILEDError: java.io.IOException at org.apache.hadoop.mapred.pipes.OutputHandler.waitForAuthentication(OutputHandler.java:186) at org.apache.hadoop.mapred.pipes.Application.waitForAuthentication(Application.java:195) at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:150) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)14/04/04 09:59:02 INFO mapreduce.Job: Task Id : attempt_1396618478715_0002_m_000001_0, Status : FAILEDError: java.io.IOException at org.apache.hadoop.mapred.pipes.OutputHandler.waitForAuthentication(OutputHandler.java:186) at org.apache.hadoop.mapred.pipes.Application.waitForAuthentication(Application.java:195) at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:150) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)14/04/04 09:59:14 INFO mapreduce.Job: Task Id : attempt_1396618478715_0002_m_000000_1, Status : FAILEDError: java.io.IOException at org.apache.hadoop.mapred.pipes.OutputHandler.waitForAuthentication(OutputHandler.java:186) at org.apache.hadoop.mapred.pipes.Application.waitForAuthentication(Application.java:195) at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:150) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)14/04/04 09:59:15 INFO mapreduce.Job: Task Id : attempt_1396618478715_0002_m_000001_1, Status : FAILEDError: java.io.IOException at org.apache.hadoop.mapred.pipes.OutputHandler.waitForAuthentication(OutputHandler.java:186) at org.apache.hadoop.mapred.pipes.Application.waitForAuthentication(Application.java:195) at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:150) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)14/04/04 09:59:26 INFO mapreduce.Job: Task Id : attempt_1396618478715_0002_m_000000_2, Status : FAILEDError: java.io.IOException at org.apache.hadoop.mapred.pipes.OutputHandler.waitForAuthentication(OutputHandler.java:186) at org.apache.hadoop.mapred.pipes.Application.waitForAuthentication(Application.java:195) at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:150) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)14/04/04 09:59:27 INFO mapreduce.Job: Task Id : attempt_1396618478715_0002_m_000001_2, Status : FAILEDError: java.io.IOException at org.apache.hadoop.mapred.pipes.OutputHandler.waitForAuthentication(OutputHandler.java:186) at org.apache.hadoop.mapred.pipes.Application.waitForAuthentication(Application.java:195) at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:150) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)14/04/04 09:59:40 INFO mapreduce.Job: map 100% reduce 100%14/04/04 09:59:41 INFO mapreduce.Job: Job job_1396618478715_0002 failed with state FAILED due to: Task failed task_1396618478715_0002_m_000000Job failed as tasks failed. failedMaps:1 failedReduces:014/04/04 09:59:41 INFO mapreduce.Job: Counters: 13 Job Counters  Failed map tasks=7  Killed map tasks=1  Launched map tasks=8  Other local map tasks=6  Data-local map tasks=2  Total time spent by all maps in occupied slots (ms)=88669  Total time spent by all reduces in occupied slots (ms)=0  Total time spent by all map tasks (ms)=88669  Total vcore-seconds taken by all map tasks=88669  Total megabyte-seconds taken by all map tasks=90797056 Map-Reduce Framework  CPU time spent (ms)=0  Physical memory (bytes) snapshot=0  Virtual memory (bytes) snapshot=0Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:264) at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:503) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:518)
注意这里并没有一直卡在”map 0% reduce 0%“而是很快返回了错误结果,说明之前的问题已经解决,这次肯定不是系统的问题,而是IO问题。同样检查container logs,发现了问题原因:

Server failed to authenticate. Exiting

然后搜索解决方案,最后发现原因竟然是Hadoop 2.3自身的原因,需要我们自己编译Makefile里所需要的libhadooppipes.a和libhadooputils.a这两个静态库文件以适应自己系统的需求(官网说预编译的是32位库,如果你是64位的才需要重新编译,可是我主机虽是64位但虚拟机系统是32位的,不知道为什么也不行需要重新编译)。这个要求真是很不注重用户体验的要求,于是就来到了重新编译Hadoop 2.3本地库的地狱世界……


编译Hadoop 2.3 Native Library

首先当然是先下载Hadoop 2.3的源码了,解压后根据里面的BUILDING.txt使用下面命令进行编译获得本地库:

mvn package -Pdist,native -Dskiptests -Dtar
期间出现了各种错误,基本上所有的错误都出现在参考资料9中,解决方案就是安装编译前所需要的各种依赖库和依赖程序,比如protobuf、cmake、zlib-devel和openssl-devel,都安装好了过后就”BUILD SUCCESS“了,真是不易,其实官网也给出了这些提示,说需要提前安装所依赖的东西,只是一开始懒得看,具体看参考资料10。

编译成功后,在

HADOOP_PATH/hadoop-tools/hadoop-pipes/target/native/
路径下可以看到我们所需要的libhadooppipes.a和libhadooputils.a了,将它们拷贝到master和所有slave机器上的lib/native文件里(最好事先备份一下自带的库文件,避免该方案失败),然后重新make编译wordcount程序,然后Hadoop pipes运行,最终看到这样的结果:

14/04/06 01:22:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/06 01:22:28 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.137:8032
14/04/06 01:22:28 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.137:8032
14/04/06 01:22:29 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
14/04/06 01:22:29 INFO mapred.FileInputFormat: Total input paths to process : 2
14/04/06 01:22:30 INFO mapreduce.JobSubmitter: number of splits:2
14/04/06 01:22:30 INFO Configuration.deprecation: hadoop.pipes.java.recordreader is deprecated. Instead, use mapreduce.pipes.isjavarecordreader
14/04/06 01:22:30 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/04/06 01:22:30 INFO Configuration.deprecation: hadoop.pipes.java.recordwriter is deprecated. Instead, use mapreduce.pipes.isjavarecordwriter
14/04/06 01:22:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1396756477966_0004
14/04/06 01:22:31 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/04/06 01:22:31 INFO impl.YarnClientImpl: Submitted application application_1396756477966_0004
14/04/06 01:22:31 INFO mapreduce.Job: The url to track the job: http://Master:8088/proxy/application_1396756477966_0004/
14/04/06 01:22:31 INFO mapreduce.Job: Running job: job_1396756477966_0004
14/04/06 01:22:40 INFO mapreduce.Job: Job job_1396756477966_0004 running in uber mode : false
14/04/06 01:22:40 INFO mapreduce.Job:  map 0% reduce 0%
14/04/06 01:22:56 INFO mapreduce.Job:  map 67% reduce 0%
14/04/06 01:22:57 INFO mapreduce.Job:  map 100% reduce 0%
14/04/06 01:23:09 INFO mapreduce.Job:  map 100% reduce 100%
14/04/06 01:23:11 INFO mapreduce.Job: Job job_1396756477966_0004 completed successfully
14/04/06 01:23:12 INFO mapreduce.Job: Counters: 51
    File System Counters
        FILE: Number of bytes read=118
        FILE: Number of bytes written=260641
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=266
        HDFS: Number of bytes written=86
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=30171
        Total time spent by all reduces in occupied slots (ms)=10914
        Total time spent by all map tasks (ms)=30171
        Total time spent by all reduce tasks (ms)=10914
        Total vcore-seconds taken by all map tasks=30171
        Total vcore-seconds taken by all reduce tasks=10914
        Total megabyte-seconds taken by all map tasks=30895104
        Total megabyte-seconds taken by all reduce tasks=11175936
    Map-Reduce Framework
        Map input records=8
        Map output records=9
        Map output bytes=94
        Map output materialized bytes=124
        Input split bytes=190
        Combine input records=0
        Combine output records=0
        Reduce input groups=8
        Reduce shuffle bytes=124
        Reduce input records=9
        Reduce output records=8
        Spilled Records=18
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=764
        CPU time spent (ms)=2490
        Physical memory (bytes) snapshot=384413696
        Virtual memory (bytes) snapshot=3685318656
        Total committed heap usage (bytes)=258613248
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    WORDCOUNT
        INPUT_WORDS=9
        OUTPUT_WORDS=8
    File Input Format Counters 
        Bytes Read=76
    File Output Format Counters 
        Bytes Written=86
14/04/06 01:23:12 INFO util.ExitUtil: Exiting with status 0

心里真是五味杂陈,这个配置环境的过程真是太痛苦了,当然也锻炼了发现问题和解决问题的能力,也懂得了遇到问题后,LOG永远是最佳解决途径。

谨以此文来记录Hadoop pipes运行环境的配置经历,同样也给那些与我遇到类似问题的小伙伴们指明一条明路……


参考资料

1. 如何在Hadoop集群运行JNI程序

2. JNI的替代者—使用JNA访问Java外部函数接口 

3.JNI的又一替代者—使用JNR访问Java外部函数接口(jnr-ffi) 

4. Hadoop Streaming和Pipes理解

5. 一步步教你Hadoop多节点集群安装配置

6. Hadoop 2.3.0 分布式集群搭建图文

7. Hadoop Tutorial 2.2 -- Running C++ Programs on Hadoop

8. Hadoop 新 MapReduce 框架 Yarn 详解

9.编译hadoop 2.3的native library

10. Native Libraries Guide

相关内容