mahout下的Hadoop平台上的Kmeans算法实现,hadoopkmeans


Mahout主要有协同过滤、聚类和分类三种算法的实现。现在我们就用Mahout来实现经典的Kmeans聚类算法。

首先,下载HadoopMahout。因为Mahout有很多实现是运行在Hadoop上的,所以要先安装Hadoop

具体怎么安装?简单地说一下:

1. 先安装SSH

ufw disable 关闭防火墙

 

cd .ssh/   进入ssh文件夹,没有的话,下面生产密钥的时候自动生成

ssh-keygen -t rsa 生成ssh密钥

cp id_rsa.pub authorized_keys 复制多一份

ssh localhost 测试是否联通

sudo apt-get install openssh-server 安装ssh服务

net start sshd 启动ssh服务

 

2. 解压Hadoop

 

tar -zxvf hadoop-1.1.2.tar.gz 解压tar.gz

 

3. 添加环境变量

export JAVA_HOME=/usr/local/jdk7 增加环境变量

export PATH=.:$JAVA_HOME/bin:$PATH 增加环境变量

4. 单机运行的话至少修改四个配置文件

5. 其他命令

 

hadoop namenode -format 格式化hadoopnamenode,datanode不需要格式化

start-all.sh 启动所有的hadoop服务

stop-all.sh 关闭所有的hadoop服务

start-dfs.sh 单独启动hdfs

stop-dfs.sh

start-mapred.sh 启动MapReduce的两个服务

hadoop-daemon.sh start[进程名称] 单独启动进程

 

jps 查看正在运行的各种进程

 

ps -e | grep ssh  查看防火墙服务是否开启

ifconfig -a |grep inet 查看网络连接地址

6. Mahout的安装也类似

 

先解压,再配置环境变量,最后输入mahout命令,有各种算法列出来就是安装成功了!

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz

下载 Reuters21578 文本语料。也可以自己准备数据集。我用我自己的数据集来做实现。

我收集了1000首歌曲的信息,如下:


把这些信息存入mongodb数据库中,以后还要使用,当然不存也可以。然后用java代码取出来,每首歌曲生成一个txt文件。并且做了处理,标签值赋予不同的权重,歌词进行了分词处理。

Map<String,Object> outmap = new HashMap<String, Object>();
        outmap.put("flag",false);
        List<Song> list = songRepository.findAll();
        int size = list.size() ;
        String[] strs = new String[size];
        if (list != null){
            //循环每一首歌曲
            for (int i = 0; i < size; i++) {
                Song song = list.get(i);
                //有权值的标签
                StringBuilder sb = new StringBuilder();
                for (int j = 0; j < 8; j++) {
                    sb.append(song.getArtist()).append(" ");
                }
                for (int j = 0; j < 2; j++) {
                    sb.append(song.getAlbum()).append(" ");
                }
                for (int j = 0; j < 5; j++) {
                    sb.append(song.getType()).append(" ");
                }
                for (int j = 0; j < 3; j++) {
                    sb.append(song.getDistrict()).append(" ");
                }
                for (int j = 0; j < 6; j++) {
                    sb.append(song.getYears()).append(" ");
                }
                for (int j = 0; j < 3; j++) {
                    sb.append(song.getRhythm()).append(" ");
                }
                for (int j = 0; j < 4; j++) {
                    sb.append(song.getMood()).append(" ");
                }
                //无权值的歌词
                String strLrc = song.getLrc() ;
                //对歌词进行分词
                strLrc = SplitWord.splitWordBySpace(strLrc);
                sb.append(strLrc);
                strs[i] = sb.toString() ;
            }
            //写出文件
            WriteLines.writeStrBecomeTxts("C:\\Users\\xin\\Desktop\\大论文\\Scala","utf-8",strs);
            outmap.put("flag",true);
            return outmap ;
        } else {
            return outmap ;
        }

生成的文件如下:


把这些文件压缩成一个文件,也就是Hadoop可以解析的SequenceFile格式的文件

Mahout seqdirectory -i file:/usr/song-input -o file:/usr/song-output -c UTF-8 -chunk 64 -xm sequential

file:前缀是指在本地文件系统上寻找,而不是HDFS-xm sequential 就是本地执行的意思。

-chunk 64 压缩成64M一个文件,HDFS文件系统的单位就是64M

 

接着就是把SequenceFile格式的文件转换为向量Vector。把上一步生成的文件放到HDFS文件系统上。运行命令:

hadoop fs -mkdir input
hadoop fs -put /usr/song-output/chunk-0 input
Mahout seq2sparse -i input -o output -ow --weight tfidf --maxDFPercent  95 --nameVector -a org.apache.lucene.analysis.WhitespaceAnalyzer

-i 输入目录

-o 输出目录

--weight 权重公式

--maxDFPercent 过滤高词频 >95%

-a 指定分词器 因为我们前面已经用IK分过词了,这里直接按空格分词就可以了

各个参数如下图:

生成目录:

root@xin:~# hadoop fs -ls output
Warning: $HADOOP_HOME is deprecated.

Found 7 items
drwxr-xr-x   - root supergroup          0 2015-03-30 14:11 /user/root/output/df-count
-rw-r--r--   1 root supergroup      48768 2015-03-30 14:10 /user/root/output/dictionary.file-0
-rw-r--r--   1 root supergroup      51433 2015-03-30 14:11 /user/root/output/frequency.file-0
drwxr-xr-x   - root supergroup          0 2015-03-30 14:11 /user/root/output/tf-vectors
drwxr-xr-x   - root supergroup          0 2015-03-30 14:12 /user/root/output/tfidf-vectors
drwxr-xr-x   - root supergroup          0 2015-03-30 14:09 /user/root/output/tokenized-documents
drwxr-xr-x   - root supergroup          0 2015-03-30 14:10 /user/root/output/wordcount

· dictionary.file-0:词文本 -> id(int)的映射。词转化为id,这是常见做法。

· frequency.file:词id -> 文档集词频(cf)

· wordcount(目录): 词文本 -> 文档集词频(cf),这个应该是各种过滤处理之前的信息。

· df-count(目录): 词id -> 文档频率(df)

· tf-vectorstfidf-vectors (均为目录):词向量,每篇文档一行,格式为{id:特征值},其中特征值为tftfidf。有用采用了内置类型VectorWritable,需要 用命令”mahout vectordump -i <path>”查看。

· tokenized-documents:分词后的文档。

现在来运行Kmeans算法了!

Mahout kmeans -i output/tfidf-vectors -c output/kmeans-clusters -o output/kmeas -k 10
-x 200 -ow --clustering

参数说明如下:

 

    -i:输入为上面产出的tfidf向量。

    -o:每一轮迭代的结果将输出在这里。

    -k:几个簇。

    -c:这是一个神奇的变量。若不设定k,则用这个目录里面的点,作为聚类中心点。否则,随机选择k个点,作为中心点。

    -dm:距离公式,文本类型推荐用cosine距离。

    -x :最大迭代次数。

    –clustering:在mapreduce模式运行。

    –convergenceDelta:迭代收敛阈值,默认0.5,对于Cosine来说略大。

其中,clusters-k(-final)为每次迭代后,簇的20个中心点的信息。

clusterdPoints,存储了 簇id -> 文档id 的映射。

 

 

生成的结果文件夹kmeans最好拷贝出来看。

hadoop fs -get output/kmeans/* /usr/song-kmeans/
Warning: $HADOOP_HOME is deprecated.

hadoop fs -get output/dictionary.file-0 /usr/song-kmeans
Warning: $HADOOP_HOME is deprecated.

mahout clusterdump -i file:///usr/song-kmeans/clusters-5-final  -d file:///usr/song-kmeans/dictionary.file-0 -dt sequencefile -o /usr/song-result/result  -n 20

mahout seqdumper -i file:///usr/song-kmeans/clusteredPoints  -o /usr/song-result/all

clusteredPoints文件其实就是SequenceFile文件来的。

 

result文件里面的内容:


可见有太多的无用词汇,分词效果不好,这些词汇需要过滤掉!

其中前面的26是簇的IDn=7即簇中有这么多个文档。c向量是簇中心点向量,格式为 词文本:权重(点坐标)r是簇的半径向量,格式为 词文本:半径。

下面的Top Terms是簇中选取出来的特征词。


all文件里面的内容:

KeyClusterID,上面clusterdump的时候,已经说了。

Value是文档的聚类结果:wt是文档属于簇的概率,对于kmeans总是1.0/1.txt就是文档标志啦,前面seqdirectionary-nv起作用了,再后面的就是这个点的各个词id和权重了。

 

某个簇的数据有点多了,簇与簇之间数据分布不够均匀,可见聚类效果不是很好。还要改善文档质量!

整个过程:

root@xin:~# vi /etc/profile
root@xin:~# start-all.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-namenode-xin.out
xin: starting datanode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-datanode-xin.out
xin: starting secondarynamenode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-secondarynamenode-xin.out
starting jobtracker, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-jobtracker-xin.out
xin: starting tasktracker, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-tasktracker-xin.out
root@xin:~# jps
3149 NameNode
3541 SecondaryNameNode
3782 TaskTracker
3937 Jps
3632 JobTracker
3382 DataNode


=============================




root@xin:~# mahout seqdirectory -i file:/usr/song-input/ -o file:/usr/song-output/ -c UTF-8 -chunk 64 -xm sequential 
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 13:57:11 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[file:/usr/song-input/], --keyPrefix=[], --method=[sequential], --output=[file:/usr/song-output/], --startPhase=[0], --tempDir=[temp]}
15/03/30 13:57:11 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/03/30 13:57:11 INFO driver.MahoutDriver: Program took 411 ms (Minutes: 0.00685)



====================================================


root@xin:~# hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 3 items
drwxr-xr-x   - root supergroup          0 2015-03-29 19:49 /user/root/input
drwxr-xr-x   - root supergroup          0 2015-03-29 22:31 /user/root/look
drwxr-xr-x   - root supergroup          0 2015-03-29 20:05 /user/root/output
root@xin:~# hadoop fs -rmr input
Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://xin:9000/user/root/input
root@xin:~# hadoop fs -rmr look
Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://xin:9000/user/root/look
root@xin:~# hadoop fs -rmr output
Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://xin:9000/user/root/output
root@xin:~# hadoop fs -mkdir input
Warning: $HADOOP_HOME is deprecated.

root@xin:~# hadoop fs -put /usr/song-output/chunk-0 input
Warning: $HADOOP_HOME is deprecated.

==========================

/58.txt	蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴情歌 蔡琴情歌 经典 经典 经典 经典 经典 台湾 台湾 台湾 70s 70s 70s 70s 70s 70s 慢板 慢板 慢板 祝福 祝福 祝福 祝福 读你 千遍 也 不 厌倦 读你 感觉 像 三月 浪漫 季节 醉人 诗篇 唔 读你 千遍 也 不 厌倦 读你 感觉 象 春天 喜悦 经典 美丽 句点 唔 眉目之间 锁 着 爱怜 唇齿 之间 留着 誓言 一切 移动 左右 视线 是 诗篇 读你 千遍 也 不 厌倦 读你 千遍 也 不 厌倦 读你 感觉 像 三月 浪漫 季节 醉人 诗篇 唔 读你 千遍 也 不 厌倦 读你 感觉 象 春天 喜悦 经典 美丽 句点 唔 眉目之间 锁 着 爱怜 唇齿 之间 留着 誓言 一切 移动 左右 视线 是 诗篇 读你 千遍 也 不 厌倦 眉目之间 锁 着 爱怜 唇齿 之间 留着 誓言 一切 移动 左右 视线 是 诗篇 读你 千遍 也 不 厌倦 读你 千遍 也 不 厌倦 读你 千遍 也 不 厌倦 读你 

root@xin:~# hadoop fs -text input/chunk-0

================================


root@xin:~# mahout seq2sparse -i input -o output -ow --weight tfidf --maxDFPercent  95 --namedVector -a org.apache.lucene.analysis.core.WhitespaceAnalyzer
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Tokenizing documents in input
15/03/30 14:09:51 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:09:52 INFO mapred.JobClient: Running job: job_201503301351_0001
15/03/30 14:09:53 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:00 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:00 INFO mapred.JobClient: Job complete: job_201503301351_0001
15/03/30 14:10:00 INFO mapred.JobClient: Counters: 19
15/03/30 14:10:00 INFO mapred.JobClient:   Job Counters 
15/03/30 14:10:00 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4936
15/03/30 14:10:00 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:10:00 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:10:00 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:10:00 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:10:00 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
15/03/30 14:10:00 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:10:00 INFO mapred.JobClient:     Bytes Written=131137
15/03/30 14:10:00 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:10:00 INFO mapred.JobClient:     HDFS_BYTES_READ=131227
15/03/30 14:10:00 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=53968
15/03/30 14:10:00 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=131137
15/03/30 14:10:00 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:10:00 INFO mapred.JobClient:     Bytes Read=131123
15/03/30 14:10:00 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:10:00 INFO mapred.JobClient:     Map input records=149
15/03/30 14:10:00 INFO mapred.JobClient:     Physical memory (bytes) snapshot=89587712
15/03/30 14:10:00 INFO mapred.JobClient:     Spilled Records=0
15/03/30 14:10:00 INFO mapred.JobClient:     CPU time spent (ms)=590
15/03/30 14:10:00 INFO mapred.JobClient:     Total committed heap usage (bytes)=120061952
15/03/30 14:10:00 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=675536896
15/03/30 14:10:00 INFO mapred.JobClient:     Map output records=149
15/03/30 14:10:00 INFO mapred.JobClient:     SPLIT_RAW_BYTES=104
15/03/30 14:10:00 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors
15/03/30 14:10:00 INFO vectorizer.DictionaryVectorizer: Creating dictionary from output/tokenized-documents and saving at output/wordcount
15/03/30 14:10:00 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:10:00 INFO mapred.JobClient: Running job: job_201503301351_0002
15/03/30 14:10:01 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:06 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:13 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:10:15 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:10:15 INFO mapred.JobClient: Job complete: job_201503301351_0002
15/03/30 14:10:15 INFO mapred.JobClient: Counters: 29
15/03/30 14:10:15 INFO mapred.JobClient:   Job Counters 
15/03/30 14:10:15 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:10:15 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4037
15/03/30 14:10:15 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:10:15 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:10:15 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:10:15 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:10:15 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8693
15/03/30 14:10:15 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:10:15 INFO mapred.JobClient:     Bytes Written=59037
15/03/30 14:10:15 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:10:15 INFO mapred.JobClient:     FILE_BYTES_READ=69108
15/03/30 14:10:15 INFO mapred.JobClient:     HDFS_BYTES_READ=131267
15/03/30 14:10:15 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=247350
15/03/30 14:10:15 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=59037
15/03/30 14:10:15 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:10:15 INFO mapred.JobClient:     Bytes Read=131137
15/03/30 14:10:15 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:10:15 INFO mapred.JobClient:     Map output materialized bytes=69108
15/03/30 14:10:15 INFO mapred.JobClient:     Map input records=149
15/03/30 14:10:15 INFO mapred.JobClient:     Reduce shuffle bytes=69108
15/03/30 14:10:15 INFO mapred.JobClient:     Spilled Records=8116
15/03/30 14:10:15 INFO mapred.JobClient:     Map output bytes=117804
15/03/30 14:10:15 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:10:15 INFO mapred.JobClient:     CPU time spent (ms)=2850
15/03/30 14:10:15 INFO mapred.JobClient:     Combine input records=8090
15/03/30 14:10:15 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
15/03/30 14:10:15 INFO mapred.JobClient:     Reduce input records=4058
15/03/30 14:10:15 INFO mapred.JobClient:     Reduce input groups=4058
15/03/30 14:10:15 INFO mapred.JobClient:     Combine output records=4058
15/03/30 14:10:15 INFO mapred.JobClient:     Physical memory (bytes) snapshot=310415360
15/03/30 14:10:15 INFO mapred.JobClient:     Reduce output records=2542
15/03/30 14:10:15 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1367658496
15/03/30 14:10:15 INFO mapred.JobClient:     Map output records=8090
15/03/30 14:10:15 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:10:15 INFO mapred.JobClient: Running job: job_201503301351_0003
15/03/30 14:10:16 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:21 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:29 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:10:31 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:10:31 INFO mapred.JobClient: Job complete: job_201503301351_0003
15/03/30 14:10:31 INFO mapred.JobClient: Counters: 29
15/03/30 14:10:31 INFO mapred.JobClient:   Job Counters 
15/03/30 14:10:31 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:10:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3865
15/03/30 14:10:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:10:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:10:31 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:10:31 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:10:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8558
15/03/30 14:10:31 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:10:31 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:10:31 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:10:31 INFO mapred.JobClient:     FILE_BYTES_READ=178553
15/03/30 14:10:31 INFO mapred.JobClient:     HDFS_BYTES_READ=131267
15/03/30 14:10:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=371870
15/03/30 14:10:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:10:31 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:10:31 INFO mapred.JobClient:     Bytes Read=131137
15/03/30 14:10:31 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:10:31 INFO mapred.JobClient:     Map output materialized bytes=129393
15/03/30 14:10:31 INFO mapred.JobClient:     Map input records=149
15/03/30 14:10:31 INFO mapred.JobClient:     Reduce shuffle bytes=129393
15/03/30 14:10:31 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:10:31 INFO mapred.JobClient:     Map output bytes=128796
15/03/30 14:10:31 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:10:31 INFO mapred.JobClient:     CPU time spent (ms)=2200
15/03/30 14:10:31 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:10:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
15/03/30 14:10:31 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:10:31 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:10:31 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:10:31 INFO mapred.JobClient:     Physical memory (bytes) snapshot=290947072
15/03/30 14:10:31 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:10:31 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1365016576
15/03/30 14:10:31 INFO mapred.JobClient:     Map output records=149
15/03/30 14:10:31 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:10:31 INFO mapred.JobClient: Running job: job_201503301351_0004
15/03/30 14:10:32 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:37 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:44 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:10:45 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:10:46 INFO mapred.JobClient: Job complete: job_201503301351_0004
15/03/30 14:10:46 INFO mapred.JobClient: Counters: 29
15/03/30 14:10:46 INFO mapred.JobClient:   Job Counters 
15/03/30 14:10:46 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:10:46 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3819
15/03/30 14:10:46 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:10:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:10:46 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:10:46 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:10:46 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8497
15/03/30 14:10:46 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:10:46 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:10:46 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:10:46 INFO mapred.JobClient:     FILE_BYTES_READ=69087
15/03/30 14:10:46 INFO mapred.JobClient:     HDFS_BYTES_READ=70499
15/03/30 14:10:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=248304
15/03/30 14:10:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:10:46 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:10:46 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:10:46 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:10:46 INFO mapred.JobClient:     Map output materialized bytes=69087
15/03/30 14:10:46 INFO mapred.JobClient:     Map input records=149
15/03/30 14:10:46 INFO mapred.JobClient:     Reduce shuffle bytes=69087
15/03/30 14:10:46 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:10:46 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:10:46 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:10:46 INFO mapred.JobClient:     CPU time spent (ms)=1850
15/03/30 14:10:46 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:10:46 INFO mapred.JobClient:     SPLIT_RAW_BYTES=128
15/03/30 14:10:46 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:10:46 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:10:46 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:10:46 INFO mapred.JobClient:     Physical memory (bytes) snapshot=296898560
15/03/30 14:10:46 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:10:46 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1361080320
15/03/30 14:10:46 INFO mapred.JobClient:     Map output records=149
15/03/30 14:10:46 INFO common.HadoopUtil: Deleting output/partial-vectors-0
15/03/30 14:10:46 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating IDF
15/03/30 14:10:46 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:10:46 INFO mapred.JobClient: Running job: job_201503301351_0005
15/03/30 14:10:47 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:52 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:59 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:11:00 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:11:01 INFO mapred.JobClient: Job complete: job_201503301351_0005
15/03/30 14:11:01 INFO mapred.JobClient: Counters: 29
15/03/30 14:11:01 INFO mapred.JobClient:   Job Counters 
15/03/30 14:11:01 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:11:01 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3979
15/03/30 14:11:01 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:11:01 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:11:01 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:11:01 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:11:01 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8495
15/03/30 14:11:01 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:11:01 INFO mapred.JobClient:     Bytes Written=51453
15/03/30 14:11:01 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:11:01 INFO mapred.JobClient:     FILE_BYTES_READ=35608
15/03/30 14:11:01 INFO mapred.JobClient:     HDFS_BYTES_READ=70500
15/03/30 14:11:01 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=180070
15/03/30 14:11:01 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=51453
15/03/30 14:11:01 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:11:01 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:11:01 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:11:01 INFO mapred.JobClient:     Map output materialized bytes=35608
15/03/30 14:11:01 INFO mapred.JobClient:     Map input records=149
15/03/30 14:11:01 INFO mapred.JobClient:     Reduce shuffle bytes=35608
15/03/30 14:11:01 INFO mapred.JobClient:     Spilled Records=5086
15/03/30 14:11:01 INFO mapred.JobClient:     Map output bytes=80676
15/03/30 14:11:01 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:11:01 INFO mapred.JobClient:     CPU time spent (ms)=2120
15/03/30 14:11:01 INFO mapred.JobClient:     Combine input records=6723
15/03/30 14:11:01 INFO mapred.JobClient:     SPLIT_RAW_BYTES=129
15/03/30 14:11:01 INFO mapred.JobClient:     Reduce input records=2543
15/03/30 14:11:01 INFO mapred.JobClient:     Reduce input groups=2543
15/03/30 14:11:01 INFO mapred.JobClient:     Combine output records=2543
15/03/30 14:11:01 INFO mapred.JobClient:     Physical memory (bytes) snapshot=289153024
15/03/30 14:11:01 INFO mapred.JobClient:     Reduce output records=2543
15/03/30 14:11:01 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1364127744
15/03/30 14:11:01 INFO mapred.JobClient:     Map output records=6723
15/03/30 14:11:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Pruning
15/03/30 14:11:01 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:11:02 INFO mapred.JobClient: Running job: job_201503301351_0006
15/03/30 14:11:03 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:11:08 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:11:15 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:11:16 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:11:17 INFO mapred.JobClient: Job complete: job_201503301351_0006
15/03/30 14:11:17 INFO mapred.JobClient: Counters: 29
15/03/30 14:11:17 INFO mapred.JobClient:   Job Counters 
15/03/30 14:11:17 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:11:17 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3775
15/03/30 14:11:17 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:11:17 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:11:17 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:11:17 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:11:17 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8512
15/03/30 14:11:17 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:11:17 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:11:17 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:11:17 INFO mapred.JobClient:     FILE_BYTES_READ=70763
15/03/30 14:11:17 INFO mapred.JobClient:     HDFS_BYTES_READ=70500
15/03/30 14:11:17 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=149132
15/03/30 14:11:17 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:11:17 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:11:17 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:11:17 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:11:17 INFO mapred.JobClient:     Map output materialized bytes=18910
15/03/30 14:11:17 INFO mapred.JobClient:     Map input records=149
15/03/30 14:11:17 INFO mapred.JobClient:     Reduce shuffle bytes=18910
15/03/30 14:11:17 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:11:17 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:11:17 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:11:17 INFO mapred.JobClient:     CPU time spent (ms)=1710
15/03/30 14:11:17 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:11:17 INFO mapred.JobClient:     SPLIT_RAW_BYTES=129
15/03/30 14:11:17 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:11:17 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:11:17 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:11:17 INFO mapred.JobClient:     Physical memory (bytes) snapshot=288608256
15/03/30 14:11:17 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:11:17 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1364774912
15/03/30 14:11:17 INFO mapred.JobClient:     Map output records=149
15/03/30 14:11:17 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:11:17 INFO mapred.JobClient: Running job: job_201503301351_0007
15/03/30 14:11:18 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:11:22 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:11:30 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:11:31 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:11:31 INFO mapred.JobClient: Job complete: job_201503301351_0007
15/03/30 14:11:31 INFO mapred.JobClient: Counters: 29
15/03/30 14:11:31 INFO mapred.JobClient:   Job Counters 
15/03/30 14:11:31 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:11:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3756
15/03/30 14:11:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:11:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:11:31 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:11:31 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:11:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8400
15/03/30 14:11:31 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:11:31 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:11:31 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:11:31 INFO mapred.JobClient:     FILE_BYTES_READ=69087
15/03/30 14:11:31 INFO mapred.JobClient:     HDFS_BYTES_READ=70510
15/03/30 14:11:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=247208
15/03/30 14:11:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:11:31 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:11:31 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:11:31 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:11:31 INFO mapred.JobClient:     Map output materialized bytes=69087
15/03/30 14:11:31 INFO mapred.JobClient:     Map input records=149
15/03/30 14:11:31 INFO mapred.JobClient:     Reduce shuffle bytes=69087
15/03/30 14:11:31 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:11:31 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:11:31 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:11:31 INFO mapred.JobClient:     CPU time spent (ms)=1530
15/03/30 14:11:31 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:11:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=139
15/03/30 14:11:31 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:11:31 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:11:31 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:11:31 INFO mapred.JobClient:     Physical memory (bytes) snapshot=288825344
15/03/30 14:11:31 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:11:31 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1367453696
15/03/30 14:11:31 INFO mapred.JobClient:     Map output records=149
15/03/30 14:11:31 INFO common.HadoopUtil: Deleting output/tf-vectors-partial
15/03/30 14:11:31 INFO common.HadoopUtil: Deleting output/tf-vectors-toprune
15/03/30 14:11:31 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:11:31 INFO mapred.JobClient: Running job: job_201503301351_0008
15/03/30 14:11:32 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:11:37 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:11:44 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:11:45 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:11:46 INFO mapred.JobClient: Job complete: job_201503301351_0008
15/03/30 14:11:46 INFO mapred.JobClient: Counters: 29
15/03/30 14:11:46 INFO mapred.JobClient:   Job Counters 
15/03/30 14:11:46 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:11:46 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3788
15/03/30 14:11:46 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:11:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:11:46 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:11:46 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:11:46 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8494
15/03/30 14:11:46 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:11:46 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:11:46 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:11:46 INFO mapred.JobClient:     FILE_BYTES_READ=120932
15/03/30 14:11:46 INFO mapred.JobClient:     HDFS_BYTES_READ=70492
15/03/30 14:11:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250986
15/03/30 14:11:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:11:46 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:11:46 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:11:46 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:11:46 INFO mapred.JobClient:     Map output materialized bytes=69087
15/03/30 14:11:46 INFO mapred.JobClient:     Map input records=149
15/03/30 14:11:46 INFO mapred.JobClient:     Reduce shuffle bytes=69087
15/03/30 14:11:46 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:11:46 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:11:46 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:11:46 INFO mapred.JobClient:     CPU time spent (ms)=1570
15/03/30 14:11:46 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:11:46 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:11:46 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:11:46 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:11:46 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:11:46 INFO mapred.JobClient:     Physical memory (bytes) snapshot=289206272
15/03/30 14:11:46 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:11:46 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1363017728
15/03/30 14:11:46 INFO mapred.JobClient:     Map output records=149
15/03/30 14:11:46 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:11:47 INFO mapred.JobClient: Running job: job_201503301351_0009
15/03/30 14:11:48 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:11:52 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:11:59 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:12:01 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:12:01 INFO mapred.JobClient: Job complete: job_201503301351_0009
15/03/30 14:12:01 INFO mapred.JobClient: Counters: 29
15/03/30 14:12:01 INFO mapred.JobClient:   Job Counters 
15/03/30 14:12:01 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:12:01 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3728
15/03/30 14:12:01 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:12:01 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:12:01 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:12:01 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:12:01 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8502
15/03/30 14:12:01 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:12:01 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:12:01 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:12:01 INFO mapred.JobClient:     FILE_BYTES_READ=69087
15/03/30 14:12:01 INFO mapred.JobClient:     HDFS_BYTES_READ=70499
15/03/30 14:12:01 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=248294
15/03/30 14:12:01 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:12:01 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:12:01 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:12:01 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:12:01 INFO mapred.JobClient:     Map output materialized bytes=69087
15/03/30 14:12:01 INFO mapred.JobClient:     Map input records=149
15/03/30 14:12:01 INFO mapred.JobClient:     Reduce shuffle bytes=69087
15/03/30 14:12:01 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:12:01 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:12:01 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:12:01 INFO mapred.JobClient:     CPU time spent (ms)=2130
15/03/30 14:12:01 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:12:01 INFO mapred.JobClient:     SPLIT_RAW_BYTES=128
15/03/30 14:12:01 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:12:01 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:12:01 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:12:01 INFO mapred.JobClient:     Physical memory (bytes) snapshot=301170688
15/03/30 14:12:01 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:12:01 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1368752128
15/03/30 14:12:01 INFO mapred.JobClient:     Map output records=149
15/03/30 14:12:01 INFO common.HadoopUtil: Deleting output/partial-vectors-0
15/03/30 14:12:01 INFO driver.MahoutDriver: Program took 130017 ms (Minutes: 2.16695)


====================================

root@xin:~# hadoop fs -ls output
Warning: $HADOOP_HOME is deprecated.

Found 7 items
drwxr-xr-x   - root supergroup          0 2015-03-30 14:11 /user/root/output/df-count
-rw-r--r--   1 root supergroup      48768 2015-03-30 14:10 /user/root/output/dictionary.file-0
-rw-r--r--   1 root supergroup      51433 2015-03-30 14:11 /user/root/output/frequency.file-0
drwxr-xr-x   - root supergroup          0 2015-03-30 14:11 /user/root/output/tf-vectors
drwxr-xr-x   - root supergroup          0 2015-03-30 14:12 /user/root/output/tfidf-vectors
drwxr-xr-x   - root supergroup          0 2015-03-30 14:09 /user/root/output/tokenized-documents
drwxr-xr-x   - root supergroup          0 2015-03-30 14:10 /user/root/output/wordcount


==========================================



root@xin:~# mahout kmeans -i output/tf-vectors -c output/kmeans-clusters -o output/kmeans -k 10 -x 200 -ow --clustering
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 14:17:04 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[output/kmeans-clusters], --convergenceDelta=[0.5], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[output/tf-vectors], --maxIter=[200], --method=[mapreduce], --numClusters=[10], --output=[output/kmeans], --overwrite=null, --startPhase=[0], --tempDir=[temp]}
15/03/30 14:17:04 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/03/30 14:17:04 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
15/03/30 14:17:04 INFO compress.CodecPool: Got brand-new compressor
15/03/30 14:17:04 INFO kmeans.RandomSeedGenerator: Wrote 10 Klusters to output/kmeans-clusters/part-randomSeed
15/03/30 14:17:04 INFO kmeans.KMeansDriver: Input: output/tf-vectors Clusters In: output/kmeans-clusters/part-randomSeed Out: output/kmeans
15/03/30 14:17:04 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 200
15/03/30 14:17:04 INFO compress.CodecPool: Got brand-new decompressor
15/03/30 14:17:05 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:17:05 INFO mapred.JobClient: Running job: job_201503301351_0010
15/03/30 14:17:06 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:17:11 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:17:18 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:17:19 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:17:20 INFO mapred.JobClient: Job complete: job_201503301351_0010
15/03/30 14:17:20 INFO mapred.JobClient: Counters: 29
15/03/30 14:17:20 INFO mapred.JobClient:   Job Counters 
15/03/30 14:17:20 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:17:20 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4154
15/03/30 14:17:20 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:17:20 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:17:20 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:17:20 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:17:20 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8558
15/03/30 14:17:20 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:17:20 INFO mapred.JobClient:     Bytes Written=64996
15/03/30 14:17:20 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:17:20 INFO mapred.JobClient:     FILE_BYTES_READ=70419
15/03/30 14:17:20 INFO mapred.JobClient:     HDFS_BYTES_READ=96550
15/03/30 14:17:20 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250490
15/03/30 14:17:20 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=64996
15/03/30 14:17:20 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:17:20 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:17:20 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:17:20 INFO mapred.JobClient:     Map output materialized bytes=70419
15/03/30 14:17:20 INFO mapred.JobClient:     Map input records=149
15/03/30 14:17:20 INFO mapred.JobClient:     Reduce shuffle bytes=70419
15/03/30 14:17:20 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:17:20 INFO mapred.JobClient:     Map output bytes=70373
15/03/30 14:17:20 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:17:20 INFO mapred.JobClient:     CPU time spent (ms)=2950
15/03/30 14:17:20 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:17:20 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:17:20 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:17:20 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:17:20 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:17:20 INFO mapred.JobClient:     Physical memory (bytes) snapshot=306675712
15/03/30 14:17:20 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:17:20 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1367547904
15/03/30 14:17:20 INFO mapred.JobClient:     Map output records=10
15/03/30 14:17:20 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:17:20 INFO mapred.JobClient: Running job: job_201503301351_0011
15/03/30 14:17:21 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:17:26 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:17:34 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:17:36 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:17:36 INFO mapred.JobClient: Job complete: job_201503301351_0011
15/03/30 14:17:36 INFO mapred.JobClient: Counters: 29
15/03/30 14:17:36 INFO mapred.JobClient:   Job Counters 
15/03/30 14:17:36 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:17:36 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4041
15/03/30 14:17:36 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:17:36 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:17:36 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:17:36 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:17:36 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8708
15/03/30 14:17:36 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:17:36 INFO mapred.JobClient:     Bytes Written=64018
15/03/30 14:17:36 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:17:36 INFO mapred.JobClient:     FILE_BYTES_READ=128966
15/03/30 14:17:36 INFO mapred.JobClient:     HDFS_BYTES_READ=200872
15/03/30 14:17:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=367584
15/03/30 14:17:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=64018
15/03/30 14:17:36 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:17:36 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:17:36 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:17:36 INFO mapred.JobClient:     Map output materialized bytes=128966
15/03/30 14:17:36 INFO mapred.JobClient:     Map input records=149
15/03/30 14:17:36 INFO mapred.JobClient:     Reduce shuffle bytes=128966
15/03/30 14:17:36 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:17:36 INFO mapred.JobClient:     Map output bytes=128919
15/03/30 14:17:36 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:17:36 INFO mapred.JobClient:     CPU time spent (ms)=3050
15/03/30 14:17:36 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:17:36 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:17:36 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:17:36 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:17:36 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:17:36 INFO mapred.JobClient:     Physical memory (bytes) snapshot=301654016
15/03/30 14:17:36 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:17:36 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1368375296
15/03/30 14:17:36 INFO mapred.JobClient:     Map output records=10
15/03/30 14:17:36 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:17:36 INFO mapred.JobClient: Running job: job_201503301351_0012
15/03/30 14:17:37 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:17:42 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:17:49 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:17:50 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:17:51 INFO mapred.JobClient: Job complete: job_201503301351_0012
15/03/30 14:17:51 INFO mapred.JobClient: Counters: 29
15/03/30 14:17:51 INFO mapred.JobClient:   Job Counters 
15/03/30 14:17:51 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:17:51 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4081
15/03/30 14:17:51 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:17:51 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:17:51 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:17:51 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:17:51 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8601
15/03/30 14:17:51 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:17:51 INFO mapred.JobClient:     Bytes Written=61455
15/03/30 14:17:51 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:17:51 INFO mapred.JobClient:     FILE_BYTES_READ=125434
15/03/30 14:17:51 INFO mapred.JobClient:     HDFS_BYTES_READ=198916
15/03/30 14:17:51 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=360520
15/03/30 14:17:51 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=61455
15/03/30 14:17:51 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:17:51 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:17:51 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:17:51 INFO mapred.JobClient:     Map output materialized bytes=125434
15/03/30 14:17:51 INFO mapred.JobClient:     Map input records=149
15/03/30 14:17:51 INFO mapred.JobClient:     Reduce shuffle bytes=125434
15/03/30 14:17:51 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:17:51 INFO mapred.JobClient:     Map output bytes=125387
15/03/30 14:17:51 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:17:51 INFO mapred.JobClient:     CPU time spent (ms)=2850
15/03/30 14:17:51 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:17:51 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:17:51 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:17:51 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:17:51 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:17:51 INFO mapred.JobClient:     Physical memory (bytes) snapshot=298000384
15/03/30 14:17:51 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:17:51 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1369067520
15/03/30 14:17:51 INFO mapred.JobClient:     Map output records=10
15/03/30 14:17:51 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:17:51 INFO mapred.JobClient: Running job: job_201503301351_0013
15/03/30 14:17:52 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:17:57 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:18:04 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:18:06 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:18:06 INFO mapred.JobClient: Job complete: job_201503301351_0013
15/03/30 14:18:06 INFO mapred.JobClient: Counters: 29
15/03/30 14:18:06 INFO mapred.JobClient:   Job Counters 
15/03/30 14:18:06 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:18:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4191
15/03/30 14:18:06 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:18:06 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:18:06 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:18:06 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:18:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8661
15/03/30 14:18:06 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:18:06 INFO mapred.JobClient:     Bytes Written=61248
15/03/30 14:18:06 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:18:06 INFO mapred.JobClient:     FILE_BYTES_READ=121841
15/03/30 14:18:06 INFO mapred.JobClient:     HDFS_BYTES_READ=193790
15/03/30 14:18:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=353334
15/03/30 14:18:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=61248
15/03/30 14:18:06 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:18:06 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:18:06 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:18:06 INFO mapred.JobClient:     Map output materialized bytes=121841
15/03/30 14:18:06 INFO mapred.JobClient:     Map input records=149
15/03/30 14:18:06 INFO mapred.JobClient:     Reduce shuffle bytes=121841
15/03/30 14:18:06 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:18:06 INFO mapred.JobClient:     Map output bytes=121794
15/03/30 14:18:06 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:18:06 INFO mapred.JobClient:     CPU time spent (ms)=3380
15/03/30 14:18:06 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:18:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:18:06 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:18:06 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:18:06 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:18:06 INFO mapred.JobClient:     Physical memory (bytes) snapshot=306253824
15/03/30 14:18:06 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:18:06 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1372364800
15/03/30 14:18:06 INFO mapred.JobClient:     Map output records=10
15/03/30 14:18:06 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:18:06 INFO mapred.JobClient: Running job: job_201503301351_0014
15/03/30 14:18:07 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:18:12 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:18:19 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:18:21 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:18:21 INFO mapred.JobClient: Job complete: job_201503301351_0014
15/03/30 14:18:21 INFO mapred.JobClient: Counters: 29
15/03/30 14:18:21 INFO mapred.JobClient:   Job Counters 
15/03/30 14:18:21 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:18:21 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4242
15/03/30 14:18:21 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:18:21 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:18:21 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:18:21 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:18:21 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8624
15/03/30 14:18:21 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:18:21 INFO mapred.JobClient:     Bytes Written=61248
15/03/30 14:18:21 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:18:21 INFO mapred.JobClient:     FILE_BYTES_READ=121634
15/03/30 14:18:21 INFO mapred.JobClient:     HDFS_BYTES_READ=193376
15/03/30 14:18:21 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=352920
15/03/30 14:18:21 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=61248
15/03/30 14:18:21 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:18:21 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:18:21 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:18:21 INFO mapred.JobClient:     Map output materialized bytes=121634
15/03/30 14:18:21 INFO mapred.JobClient:     Map input records=149
15/03/30 14:18:21 INFO mapred.JobClient:     Reduce shuffle bytes=121634
15/03/30 14:18:21 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:18:21 INFO mapred.JobClient:     Map output bytes=121587
15/03/30 14:18:21 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:18:21 INFO mapred.JobClient:     CPU time spent (ms)=3060
15/03/30 14:18:21 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:18:21 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:18:21 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:18:21 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:18:21 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:18:21 INFO mapred.JobClient:     Physical memory (bytes) snapshot=295936000
15/03/30 14:18:21 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:18:21 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1362153472
15/03/30 14:18:21 INFO mapred.JobClient:     Map output records=10
15/03/30 14:18:21 INFO kmeans.KMeansDriver: Clustering data
15/03/30 14:18:21 INFO kmeans.KMeansDriver: Running Clustering
15/03/30 14:18:21 INFO kmeans.KMeansDriver: Input: output/tf-vectors Clusters In: output/kmeans Out: output/kmeans
15/03/30 14:18:22 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:18:22 INFO mapred.JobClient: Running job: job_201503301351_0015
15/03/30 14:18:23 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:18:29 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:18:30 INFO mapred.JobClient: Job complete: job_201503301351_0015
15/03/30 14:18:30 INFO mapred.JobClient: Counters: 19
15/03/30 14:18:30 INFO mapred.JobClient:   Job Counters 
15/03/30 14:18:30 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5264
15/03/30 14:18:30 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:18:30 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:18:30 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:18:30 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:18:30 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
15/03/30 14:18:30 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:18:30 INFO mapred.JobClient:     Bytes Written=75851
15/03/30 14:18:30 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:18:30 INFO mapred.JobClient:     HDFS_BYTES_READ=131934
15/03/30 14:18:30 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54540
15/03/30 14:18:30 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=75851
15/03/30 14:18:30 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:18:30 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:18:30 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:18:30 INFO mapred.JobClient:     Map input records=149
15/03/30 14:18:30 INFO mapred.JobClient:     Physical memory (bytes) snapshot=113307648
15/03/30 14:18:30 INFO mapred.JobClient:     Spilled Records=0
15/03/30 14:18:30 INFO mapred.JobClient:     CPU time spent (ms)=1620
15/03/30 14:18:30 INFO mapred.JobClient:     Total committed heap usage (bytes)=120061952
15/03/30 14:18:30 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=680222720
15/03/30 14:18:30 INFO mapred.JobClient:     Map output records=149
15/03/30 14:18:30 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:18:30 INFO driver.MahoutDriver: Program took 86159 ms (Minutes: 1.4359833333333334)


======================================

root@xin:~# hadoop fs -ls output/kmeans
Warning: $HADOOP_HOME is deprecated.

Found 8 items
-rw-r--r--   1 root supergroup        194 2015-03-30 14:18 /user/root/output/kmeans/_policy
drwxr-xr-x   - root supergroup          0 2015-03-30 14:18 /user/root/output/kmeans/clusteredPoints
drwxr-xr-x   - root supergroup          0 2015-03-30 14:17 /user/root/output/kmeans/clusters-0
drwxr-xr-x   - root supergroup          0 2015-03-30 14:17 /user/root/output/kmeans/clusters-1
drwxr-xr-x   - root supergroup          0 2015-03-30 14:17 /user/root/output/kmeans/clusters-2
drwxr-xr-x   - root supergroup          0 2015-03-30 14:17 /user/root/output/kmeans/clusters-3
drwxr-xr-x   - root supergroup          0 2015-03-30 14:18 /user/root/output/kmeans/clusters-4
drwxr-xr-x   - root supergroup          0 2015-03-30 14:18 /user/root/output/kmeans/clusters-5-final


======================================

root@xin:~# hadoop fs -get output/kmeans/* /usr/song-kmeans/
Warning: $HADOOP_HOME is deprecated.

root@xin:~# hadoop fs -get output/dictionary.file-0 /usr/song-kmeans
Warning: $HADOOP_HOME is deprecated.

root@xin:~# mahout clusterdump -i file:///usr/song-kmeans/clusters-5-final  -d file:///usr/song-kmeans/dictionary.file-0 -dt sequencefile -o /usr/song-result/result  -n 20
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 14:34:08 INFO common.AbstractJob: Command line arguments: {--dictionary=[file:///usr/song-kmeans/dictionary.file-0], --dictionaryType=[sequencefile], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[file:///usr/song-kmeans/clusters-5-final], --numWords=[20], --output=[/usr/song-result/result], --outputFormat=[TEXT], --startPhase=[0], --tempDir=[temp]}
15/03/30 14:34:09 INFO clustering.ClusterDumper: Wrote 10 clusters
15/03/30 14:34:09 INFO driver.MahoutDriver: Program took 716 ms (Minutes: 0.011933333333333334)





Exception in thread "main" java.io.FileNotFoundException: /usr/song-result (Is a directory)
	at java.io.FileOutputStream.open(Native Method)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
	at com.google.common.io.Files.newWriter(Files.java:103)
	at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:187)
	at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:157)
	at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:101)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

===========================================



root@xin:~# mahout seqdumper -i file:///usr/song-kmeans/clusteredPoints  -o /usr/song-result/all 
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 14:44:28 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[file:///usr/song-kmeans/clusteredPoints], --output=[/usr/song-result/all], --startPhase=[0], --tempDir=[temp]}
15/03/30 14:44:29 INFO driver.MahoutDriver: Program took 634 ms (Minutes: 0.010566666666666667)


相关内容