mahout算法源码分析之Collaborative Filtering with ALS-WR (一)实战


Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。

学习总是一个痛并快乐着的过程。。。

今天简要介绍一下mahout中的Collaborative Filtering with ALS-WR,这个算法,你要问我这个是什么算法,我最多告诉你它是一个推荐算法,其他我也不知道。这里主要是参考这里的介绍Collaborative Filtering with ALS-WR。

此篇作为实战,就是要先把算法跑起来,先不管具体实现过程,通过现象,看到什么,然后才分析具体实现过程。看到官网的介绍上面说其实这个算法跑的是examples/bin/factorize-movielens-1M.sh 这个文件,那么就打开这个文件来看看吧:

# Instructions:
#
# Before using this script, you have to download and extract the Movielens 1M dataset
# from http://www.grouplens.org/node/73
#
# To run:  change into the mahout directory and type:
#  examples/bin/factorize-movielens-1M.sh /path/to/ratings.dat

if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then
  echo "This script runs the Alternating Least Squares Recommender on the Grouplens data set (size 1M)."
  echo "Syntax: $0 /path/to/ratings.dat\n"
  exit
fi

if [ $# -ne 1 ]
then
  echo -e "\nYou have to download the Movielens 1M dataset from http://www.grouplens.org/node/73 before"
  echo -e "you can run this example. After that extract it and supply the path to the ratings.dat file.\n"
  echo -e "Syntax: $0 /path/to/ratings.dat\n"
  exit -1
fi

MAHOUT="../../bin/mahout"

WORK_DIR=/tmp/mahout-work-${USER}
echo "creating work directory at ${WORK_DIR}"
mkdir -p ${WORK_DIR}/movielens

echo "Converting ratings..."
cat $1 |sed -e s/::/,/g| cut -d, -f1,2,3 > ${WORK_DIR}/movielens/ratings.csv

# create a 90% percent training set and a 10% probe set
$MAHOUT splitDataset --input ${WORK_DIR}/movielens/ratings.csv --output ${WORK_DIR}/dataset \
    --trainingPercentage 0.9 --probePercentage 0.1 --tempDir ${WORK_DIR}/dataset/tmp

# run distributed ALS-WR to factorize the rating matrix defined by the training set
$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \
    --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065

# compute predictions against the probe set, measure the error
$MAHOUT evaluateFactorization --input ${WORK_DIR}/dataset/probeSet/ --output ${WORK_DIR}/als/rmse/ \
    --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ --tempDir ${WORK_DIR}/als/tmp

# compute recommendations
$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \
    --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \
    --numRecommendations 6 --maxRating 5

# print the error
echo -e "\nRMSE is:\n"
cat ${WORK_DIR}/als/rmse/rmse.txt
echo -e "\n"

echo -e "\nSample recommendations:\n"
shuf ${WORK_DIR}/recommendations/part-m-00000 |head
echo -e "\n\n"

echo "removing work directory"
rm -rf ${WORK_DIR}mahout@ubuntu:~/mahout-d-0.7/examples/bin$
这里可以看到一共有5个操作:(1)把原始数据转换为我们需要的格式;(2)分数据集;(3)并行ALS;(4)评价算法模型;(5)进行推荐;下面来一个一个进行实战:

(1)转换数据,下载原始数据MovieLens Data Sets,这里下载的是1M数据,解压后,打开ratings.dat,可以看到下面的数据:

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
然后使用linux命令:cat ratings.dat |sed -e s/::/,/g|cut -d, -f1,2,3 > ratings.csv,把数据转换成下面的形式:

1,1193,5
1,661,3
1,914,3
1,3408,4
1,2355,5
1,1197,3
1,1287,5
1,2804,5
1,594,4
1,919,4
这里简要介绍下数据ratings.dat 的结构如下:UserID::MovieID::Rating::Timestamp,然后转换后的结构如下:UserID,MovieID,Rating。

然后把生成的ratings.csv上传到HDFS文件系统,准备进行下一步。

(2)分数聚集为训练数据和测试数据:进入mahout根目录,使用命令splitDataset,下面是这个命令的参数:

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                              Path to job input directory.  
  --output (-o) output                            The directory pathname for    
                                                  output.                       
  --trainingPercentage (-t) trainingPercentage    percentage of the data to use 
                                                  as training set (default:     
                                                  0.9)                          
  --probePercentage (-p) probePercentage          percentage of the data to use 
                                                  as probe set (default: 0.1)   
  --help (-h)                                     Print out help                
  --tempDir tempDir                               Intermediate output directory 
  --startPhase startPhase                         First phase to run            
  --endPhase endPhase                             Last phase to run          
命令为:./mahout splitDataset -i input/ratings.csv -o output/als -t 0.9 -p 0.1 --tempDir temp ,运行完成后,可以看到该命令一共运行了三个Job,分别产生了三分输出结果:(a)应该是原始数据的转换,输入的map记录数为100020,输出也是100020;(b)是产生训练数据集,输入100020条记录,输出900362条记录;(c)输入100020条记录,输出99847条记录;

(3)并行ALS:命令为./mahout parallelALS ,先看其使用参数和方法:

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                     Path to job input directory.           
  --output (-o) output                   The directory pathname for output.     
  --lambda lambda                        regularization parameter               
  --implicitFeedback implicitFeedback    data consists of implicit feedback?    
  --alpha alpha                          confidence parameter (only used on     
                                         implicit feedback)                     
  --numFeatures numFeatures              dimension of the feature space         
  --numIterations numIterations          number of iterations                   
  --help (-h)                            Print out help                         
  --tempDir tempDir                      Intermediate output directory          
  --startPhase startPhase                First phase to run                     
  --endPhase endPhase                    Last phase to run        
然后使用命令:./mahout parallelALS -i output/als/trainingSet -o output/als/als  --tempDir temp/als --numFeatures 20 --numIterations 10 --lambda 0.065
由上面的参数可以看到应该要十次循环,但是运行完上面的命令后,可以发现mahout不止建立的10个Job。命令运行后,先跑了3个Job,然后就出现下面的提示(每跑一个任务提示一次)

13/10/03 21:27:24 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 0/10)
13/10/03 21:27:50 INFO als.ParallelALSFactorizationJob: Recomputing M (iteration 0/10)
13/10/03 21:28:20 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 1/10)
...
13/10/03 21:35:51 INFO als.ParallelALSFactorizationJob: Recomputing U (iteration 9/10)
13/10/03 21:36:17 INFO als.ParallelALSFactorizationJob: Recomputing M (iteration 9/10)
在输出文件中会有M、U和userRationgs三个文件夹,在temp中则会出现U0~U8、M0~M8、M--1、averageRatings和itemRatings这些文件夹。

(4)评价算法模型:使用的mahout命令是evaluateFactorization,首先看下其用法和参数:

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input             Path to job input directory.                   
  --userFeatures userFeatures    path to the user feature matrix                
  --itemFeatures itemFeatures    path to the item feature matrix                
  --output (-o) output           The directory pathname for output.             
  --help (-h)                    Print out help                                 
  --tempDir tempDir              Intermediate output directory                  
  --startPhase startPhase        First phase to run                             
  --endPhase endPhase            Last phase to run 
使用下面的命令来运行:./mahout evaluateFactorization -i output/als/probeSet -o output/rmse --userFeatures output/als/als/U --itemFeatures output/als/als/M --tempDir temp/rmse,命令运行完毕后,可以在HDFS的output/rmse/rmse.txt文件中查看到均方根误差为:0.8548619405669956(感觉好像均方根误差很小的样子?)

(5)推荐:推荐使用的命令是recommendfactorized,这个命令的用户和参数为:

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                         Path to job input directory.       
  --userFeatures userFeatures                path to the user feature matrix    
  --itemFeatures itemFeatures                path to the item feature matrix    
  --numRecommendations numRecommendations    number of recommendations per user 
  --maxRating maxRating                      maximum rating available           
  --output (-o) output                       The directory pathname for output. 
  --help (-h)                                Print out help                     
  --tempDir tempDir                          Intermediate output directory      
  --startPhase startPhase                    First phase to run                 
  --endPhase endPhase                        Last phase to run 
使用命令:./mahout recommendfactorized -i output/als/als/userRatings -o output/recommendations --userFeatures output/als/als/U --itemFeatures output/als/als/M --numRecommendations 6 --maxRating 5,即可运行该命令。运行完毕后,在终端中可以看到map的输出为6040条记录,正好对应了数据集中用户的数量,同时可以在相应的HDFS文件系统上面查看相应的推荐输出:




分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990



相关内容

    暂无相关文章