mahout算法源码分析之Collaborative Filtering with ALS-WR (三)QR分解数据流(1)
mahout算法源码分析之Collaborative Filtering with ALS-WR (三)QR分解数据流(1)
Vector solve(Matrix Ai, Matrix Vi) { return new QRDecomposition(Ai).solve(Vi).viewColumn(0); }虽说是一行代码,但是里面涉及的内容却好多。。。哎,当初高数的线性代数没学好呀。。。
1,101,5.0 1,102,3.0 1,103,2.5 2,101,2.0 2,102,2.5 2,103,5.0 2,104,2.0 3,101,2.5 3,104,4.0 3,105,4.5 3,107,5.0 4,101,5.0 4,103,3.0 4,104,4.5 4,106,4.0 5,101,4.0 5,102,3.0 5,103,2.0 5,104,4.0 5,105,3.5 5,106,4.0为了得到Vi和Ai,首先可以编写下面的代码,同时在ParallelALSFactorizationJob的initializeM函数后面设置断点,然后跑下面的代码:
package mahout.fansy.als.test; import org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob; public class TestParallelALSFactorizationJob { /** * 测试ParallelALSFactorizationJob ,使用小数据集 * ,在initializeM函数之后设置断点,获取必要数据 * 主要用于分析QR数据流前的数据准备 * 小数据集是<mahout in action>中的listing2.1 page 15: * @throws Exception */ public static void main(String[] args) throws Exception { String[] arg=new String[]{"-jt","ubuntu:9001","-fs","ubuntu:9000", "-i","hdfs://ubuntu:9000/test/input/user_item", "-o","hdfs://ubuntu:9000/test/output", "--lambda","0.065","--numFeatures","3","--numIterations","3", "--tempDir","hdfs://ubuntu:9000/test/temp" }; ParallelALSFactorizationJob.main(arg); } }这里的输入数据直接把前面的数据拷贝上传到HDFS相应的位置即可,然后使用下面的代码(这个代码也就是SolveExplicitFeedbackMapper的仿制代码和前篇博客一样,只是把相应的路径修改了下而已):
package mahout.fansy.als; import java.io.IOException; import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; import org.apache.mahout.math.SequentialAccessSparseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.math.VectorWritable; import org.apache.mahout.math.als.AlternatingLeastSquaresSolver; import org.apache.mahout.math.map.OpenIntObjectHashMap; import com.google.common.collect.Lists; import mahout.fansy.utils.read.ReadArbiKV; public class SolveExplicitFeedbackMapperFollow_1 { /** * 第一次调用SloveExplicitFeedBackMapper的仿制代码 * 使用小数据集 * @param args */ private static double lambda=0.065; private static int numFeatures=3; private static OpenIntObjectHashMap<Vector> UorM; private static AlternatingLeastSquaresSolver solver; public static void main(String[] args) throws IOException { setup(); map(); } /** * 获得map输入文件; * @return * @throws IOException */ public static Map<Writable,Writable> getMapData() throws IOException{ String fPath="hdfs://ubuntu:9000/test/output/userRatings/part-r-00000"; Map<Writable,Writable> mapData=ReadArbiKV.readFromFile(fPath); return mapData; } /** * 仿造setup函数 */ public static void setup(){ solver = new AlternatingLeastSquaresSolver(); UorM = ALSUtilsFollow.readMatrixByRows( new Path("hdfs://ubuntu:9000/test/temp/M--1/part-m-00000"), getConf()); } public static void map() throws IOException{ Map<Writable,Writable> map=getMapData(); for(Iterator<Entry<Writable, Writable>> iter=map.entrySet().iterator();iter.hasNext();){ Entry<Writable,Writable> entry=(Entry<Writable, Writable>) iter.next(); IntWritable userOrItemID=(IntWritable) entry.getKey(); VectorWritable ratingsWritable=(VectorWritable) entry.getValue(); // source code Vector ratings = new SequentialAccessSparseVector(ratingsWritable.get()); List<Vector> featureVectors = Lists.newArrayList(); Iterator<Vector.Element> interactions = ratings.iterateNonZero(); while (interactions.hasNext()) { int index = interactions.next().index(); featureVectors.add(UorM.get(index)); } Vector uiOrmj = solver.solve(featureVectors, ratings, lambda, numFeatures); System.out.println(userOrItemID+","+ new VectorWritable(uiOrmj)); } } /** * 获得configuration * @return */ private static Configuration getConf() { Configuration conf=new Configuration(); conf.set("mapred.job.tracker", "ubuntu:9000"); return conf; } }然后断点设置在solver.solve(...)这一行,设置在这里是为了看一些初始的变量值:
userRatings:
[1={101:5.0,102:3.0,103:2.5}, 2={101:2.0,102:2.5,103:5.0,104:2.0}, 3={101:2.5,104:4.0,105:4.5,107:5.0}, 4={101:5.0,103:3.0,104:4.5,106:4.0}, 5={101:4.0,102:3.0,103:2.0,104:4.0,105:3.5,106:4.0}]UorM: 第一列为项目平均分,其他列为随机评分(0,1)之间:
[101->{0:3.7,1:0.8671164945911651,2:0.34569609436188886}, 102->{0:2.833333333333333,1:0.26849761474873923,2:0.25305280900447447}, 103->{0:3.125,1:0.03761210458127495,2:0.8249152283326323}, 104->{0:3.625,1:0.7549644739393445,2:0.1152736727230218}, 105->{0:4.0,1:0.12274350577015558,2:0.862849667838315}, 106->{0:4.0,1:0.5113672636264076,2:0.5790585002437059}, 107->{0:5.0,1:0.4732039618109546,2:0.5447453232014403}]
user1Ratings:
{101:5.0,102:3.0,103:2.5}user1_featureVectors: 取user1Ratings中item对应的UorM中的项
[{0:3.7,1:0.8671164945911651,2:0.34569609436188886}, --> item101 {0:2.833333333333333,1:0.26849761474873923,2:0.25305280900447447}, --> item102 {0:3.125,1:0.03761210458127495,2:0.8249152283326323}] --> item103然后进入solve函数:
user1_MiIi: 把user1_featureVecots进行转置(行列转置),列分别对应item101、102、103
[[3.7, 2.833333333333333, 3.125], [0.8671164945911651, 0.26849761474873923, 0.03761210458127495], [0.34569609436188886, 0.25305280900447447, 0.8249152283326323]]RiIiMaybeTransposed:把user1Ratings去掉item进行转置,其实这里去掉item和user1_MiIi的列对应起来了
[[5.0], --> item101 [3.0], --> item102 [2.5]] --> item103Ai:MiIi矩阵乘以(MiIi的转置)然后把对角线(row=col)的项更新为原始值+lambda*user1中含有的item个数
MiIi的转置:其实就是和user1_featureVectors一样,
[[3.7, 0.8671164945911651, 0.34569609436188886], [2.833333333333333, 0.26849761474873923, 0.25305280900447447], [3.125, 0.03761210458127495, 0.8249152283326323]]MiIi矩阵乘以(MiIi的转置):矩阵相乘公式:(AB)ij=ai1*b1j+ai2*b2j+...+ain*bnj
[[31.483402777777776, 4.08661209859189, 4.573918596524476], [4.08661209859189, 0.8253966547288653, 0.3987296589988406], [4.573918596524476, 0.3987296589988406, 0.864026647737198]]更新后的值Ai: lambda*nui=0.065*3=0.195
[[31.678402777777777, 4.08661209859189, 4.573918596524476], [4.08661209859189, 1.0203966547288652, 0.3987296589988406], [4.573918596524476, 0.3987296589988406, 1.059026647737198]]Vi:矩阵MiIi和RiIiMaybeTransposed的相乘:
[[34.8125], [5.235105578655231], [4.549926969654448]]这样Ai和Vi的值就全部初始化好了,下篇详细分析new QRDecomposition(Ai).solve(Vi).viewColumn(0)。
http://blog.csdn.net/fansy1990
评论暂时关闭