Alex 的 Hadoop 菜鸟教程: 第3课 hadoop的helloworld (基于centos的CDH)
Alex 的 Hadoop 菜鸟教程: 第3课 hadoop的helloworld (基于centos的CDH)
环境:装了CDH5的hadoop
根据cdh官方文档的教程做一个简单的例子,但是cdh的教程写的不严谨,有很多坑等着你去跳,所以我改造了一下写成了这篇文章
运行wordcount单词计数案例
STEP 1 在HDFS文件系统上建立input文件夹
HDFS就像是在我们实际的文件空间上虚拟出来的一个文件空间,而且这个文件空间是跨好多台电脑的,在这个空间里面也可以建立文件夹,建立文件等操作现在我们登陆到部署好hadoop的server上(centos),找个做实验的文件夹,然后切换到hdfs的世界里:
$ sudo su hdfs $ hadoop fs -mkdir /user $ hadoop fs -mkdir /user/cloudera $ hadoop fs -ls /user Found 1 items drwxr-xr-x - hdfs hadoop 0 2014-07-02 14:52 /user/cloudera $ hadoop fs -mkdir /user/cloudera/wordcount /user/cloudera/wordcount/input $ exit
STEP 2 建立测试的文本
$ echo "Hello World Bye World" > file0$ echo "Hello Hadoop Goodbye Hadoop" > file1
$ sudo su hdfs
$ hadoop fs -put file* /user/cloudera/wordcount/input
$ exit
STEP 3 编译 WordCount.java
$ vim WordCount.java
把下面的代码粘贴进去
package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
然后编译它
$ mkdir wordcount_classes $ javac -cp <classpath> -d wordcount_classes WordCount.java
这边的 <classpath> 可以填的值根据版本不一样不同:
CDH 4
Parcel installation - /opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop/client-0.20/*
Package installation - /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*
CDH3 - /usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u6-core.jar
我们是CDH5,是跟CDH4一样的
所以完整的命令是
$ javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* -d wordcount_classes WordCount.java
成功的话没有任何回应,但是在 wordcount_classes 里面出现了org文件夹
STEP 4 创建JAR
$ jar -cvf wordcount.jar -C wordcount_classes/ .
STEP 5 运行程序
$ sudo su - hdfs
因为hdfs用户的根目录是/var/lib/hadoop-hdfs,所以我们要cd到刚刚有jar文件的目录
$ cd /data/hadoop $ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output 14/07/03 09:44:16 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/07/03 09:44:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/07/03 09:44:16 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized ...
输出了很多东西之后我们去看下结果
STEP 6 查看结果
$ hadoop fs -cat /user/cloudera/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
STEP 7 删除结果
如果你想再运行一次教程就要先删除掉结果$ hadoop fs -rm -r /user/cloudera/wordcount/output
hadoop还提供了几个选项用于传递信息给应用(wordcount就是一个应用):
-files 允许应用指定几个在运行目录下的文件(逗号分隔)给应用
-libjars 允许应用添加jar到maps 和 reduces 的 classpath
-archives 可以把归档文件当做参数传递给应用。归档文件:解压开或者unjared(打jar包的相反动作)成一个文件夹,然后用这个文件夹建立一个link,并且这个link的名字是一个zip或者jar文件(以.zip 或者 .jar 结尾)
评论暂时关闭