mahout源码分析之logistic regression（1）--实战

文章由LinuxBoy分享于2019-03-27 03:03:18热评（56）

mahout源码分析之logistic regression（1）--实战

版本：mahout0.9

Mahout里面使用逻辑回归（logistic regression）的主要两个类是org.apache.mahout.classifier.sgd.TrainLogistic、org.apache.mahout.classifier.sgd.RunLogistic，一个是建立模型，一个是进行模型评估。

首先是原始数据，格式如下：（可以在https://github.com/dirkweissenborn/mahout-rbmClassifier/blob/master/examples/src/main/resources/donut.csv#L1下载）

"x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias"
0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,0.0124828536260896,0.000182782669907495,0.923406490600458,0.0778750292332978,0.644866125183976,1
0.711011884035543,0.909141522599384,22,2,3,9,0.505537899239772,0.64641042683833,0.826538308114327,1.15415605849213,0.953966686673604,0.46035073663368,1
0.75118898646906,0.836567111080512,23,2,3,9,0.564284893392414,0.62842000028592,0.699844531341594,1.12433510339845,0.872783737128441,0.419968245447719,1

进入mahout的bin目录，运行：

./mahout trainlogistic --input /data/mahout-data/donut.csv --output /data/mahout-output/model2 --target color --categories 2 --predictors x y a b c --types numeric --features 20 --passes 100 --rate 50

这里各个参数说明如下：

input：输入数据；output：输出模型文件；--target 预测的变量（输入数据要求第一行为变量名称）；categories 预测变量的取值个数；predictors参与建模的变量；types 预测变量的类型（number、word、text其中一个，如果全部是一样的话，使用一个就可以）；pass训练的时候对输入数据测试的次数（这里也不是很清楚）；feature内部随机向量维度（用于建模，好像是这样理解，越大越好，但是时间会长）；rate学习速率（如果输入数据比较大，此值可以设置大点）。

得到下面的输出：

Running on hadoop, using /opt/hadoop2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout-distribution-0.9/examples/target/mahout-examples-0.9-job.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop2/share/hadoop/mapreduce/lib/mahout-core-0.9-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20
color ~ 
7.068*Intercept Term + 0.581*a + -1.369*b + -25.059*c + 0.581*x + 2.319*y
      Intercept Term 7.06759
                   a 0.58123
                   b -1.36893
                   c -25.05945
                   x 0.58123
                   y 2.31879
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -1.368933989     0.000000000     0.000000000     0.000000000     0.000000000     0.581234210     0.000000000     0.000000000     7.067587159     0.000000000     0.000000000     0.000000000     2.318786209     0.000000000   -25.059452292 
14/04/11 10:33:18 INFO driver.MahoutDriver: Program took 1758 ms (Minutes: 0.0293)

我这里有slf jar包的冲突，暂时不理这个。看后面的公式即可（公式变量前的值，每次训练不一定相同），应该是由这个公式算得最后的预测结果的，但是暂时不清楚Intercept是什么。

然后使用模型评估命令（测试数据：https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut-test.csv）：

 ./mahout runlogistic --input /data/mahout-data/donut-test.csv --model /data/mahout-output/model2 --scores --auc --confusion

input就是测试数据；model是模型文件；scores打印预测值和原始值对比；auc打印auc值（评判主要标准，越大越好，最好接近1）；confusion打印模糊矩阵；

得到下面的结果：

Running on hadoop, using /opt/hadoop2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout-distribution-0.9/examples/target/mahout-examples-0.9-job.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop2/share/hadoop/mapreduce/lib/mahout-core-0.9-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
"target","model-output","log-likelihood"
0,0.009,-0.009241
0,0.000,-0.000481
1,0.985,-0.015038
1,0.991,-0.009407
0,0.001,-0.000883
1,0.974,-0.026000
1,0.823,-0.194875
0,0.041,-0.042015
0,0.051,-0.052565
0,0.613,-0.950008
0,0.147,-0.158538
1,0.910,-0.094177
1,0.252,-1.377220
1,0.924,-0.078521
1,0.998,-0.001777
0,0.023,-0.023756
1,0.990,-0.009928
0,0.003,-0.003118
1,0.961,-0.039284
0,0.000,-0.000046
0,0.167,-0.183160
0,0.049,-0.049822
0,0.006,-0.005792
0,0.706,-1.222487
0,0.000,-0.000421
1,0.999,-0.001045
1,0.969,-0.031452
0,0.034,-0.034088
0,0.370,-0.461632
0,0.011,-0.011489
0,0.465,-0.624971
0,0.053,-0.054646
0,0.340,-0.414959
0,0.053,-0.054123
0,0.007,-0.006800
0,0.248,-0.285650
1,0.482,-0.728835
0,0.781,-1.516960
0,0.024,-0.023975
0,0.022,-0.022281
AUC = 0.97
confusion: [[24.0, 2.0], [3.0, 11.0]]
entropy: [[-0.2, -2.8], [-4.1, -0.1]]
14/04/11 10:43:39 INFO driver.MahoutDriver: Program took 414 ms (Minutes: 0.0069)

可以看到auc=0.97 说明模型还是比较好的；模糊矩阵中说明有2个应该被分为1的被分为了0，有3个应该是0的结果被分为了1。

本来打算使用上面得到的公式带入测试数据，看能否得到第一行的输出，比如0.009，但是不知道哪个Interceptor值是什么，所以也是没有得到0.009的。大概浏览了下源码，好像要归一化的。具体下次在分析。

总结：

目前遇到的问题有：1）如何使用上面的公式（Interceptor是什么？）；2）如何把这个在hadoop上面运行起来（从上面的结果来看，似乎mahout并没有运行在hadoop上面）。

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

推荐文章：

mahout源码分析之logistic regression（1）--实战