Python之——网站访问来源IP统计,python来源ip统计
Python之——网站访问来源IP统计,python来源ip统计
转载请注明出处:http://blog.csdn.net/l1028386804/article/details/79057671一、场景描述
数据源准备工作详见博文《Python之——自动上传本地log文件到HDFS(基于Hadoop 2.5.2)》。
统计用户的访问来源IP可以更好的了解用户的分布,同时也可以帮助安全人员捕捉攻击来源。实现的原理是:定义匹配IP正则,将匹配到的字符串作为key,将value初始化为1,执行redecuer操作时做累加统计。二、实现MapReduce
【/usr/local/python/source/ipstat.py】
# -*- coding:UTF-8 -*-
'''
Created on 2018年1月14日
@author: liuyazhuang
'''
from mrjob.job import MRJob
import re
#定义IP正则匹配
IP_RE = re.compile(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")
class MRCount(MRJob):
def mapper(self, key, line):
#匹配IP正则后生成key:value,其中key为IP地址,value初始值为1
for ip in IP_RE.findall(line):
yield ip, 1
def reducer(self, ip, occurrences):
yield ip, sum(occurrences)
if __name__ == '__main__':
MRCount.run()
三、生成MapReduce任务
执行命令:
python ipstat.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/ipstat hdfs://liuyazhuang121:9000/user/root/website.com/20180114
打印的日志如下:
[root@liuyazhuang121 source]# python ipstat.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/ipstat hdfs://liuyazhuang121:9000/user/root/website.com/20180114
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Found hadoop binary: /usr/local/hadoop-2.5.2/bin/hadoop
Using Hadoop version 2.5.2
Looking for Hadoop streaming jar in /usr/local/hadoop-2.5.2...
Found Hadoop streaming jar: /usr/local/hadoop-2.5.2/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar
Creating temp directory /tmp/ipstat.root.20180114.091040.605990
Copying local files to hdfs:///user/root/tmp/mrjob/ipstat.root.20180114.091040.605990/files/...
Running step 1 of 1...
packageJobJar: [/usr/local/hadoop-2.5.2/tmp/hadoop-unjar4828642106994965791/] [] /tmp/streamjob4775985125407933464.jar tmpDir=null
Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
Total input paths to process : 1
number of splits:2
Submitting tokens for job: job_1515893542122_0010
Submitted application application_1515893542122_0010
The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0010/
Running job: job_1515893542122_0010
Job job_1515893542122_0010 running in uber mode : false
map 0% reduce 0%
map 100% reduce 0%
map 100% reduce 100%
Job job_1515893542122_0010 completed successfully
Output directory: hdfs://liuyazhuang121:9000/output/ipstat
Counters: 49
File Input Format Counters
Bytes Read=2355499
File Output Format Counters
Bytes Written=303
File System Counters
FILE: Number of bytes read=176261
FILE: Number of bytes written=657303
FILE: Number of large read operations=0
FILE: Number of read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2355749
HDFS: Number of bytes written=303
HDFS: Number of large read operations=0
HDFS: Number of read operations=9
HDFS: Number of write operations=2
Job Counters
Data-local map tasks=2
Launched map tasks=2
Launched reduce tasks=1
Total megabyte-seconds taken by all map tasks=7339008
Total megabyte-seconds taken by all reduce tasks=3062784
Total time spent by all map tasks (ms)=7167
Total time spent by all maps in occupied slots (ms)=7167
Total time spent by all reduce tasks (ms)=2991
Total time spent by all reduces in occupied slots (ms)=2991
Total vcore-seconds taken by all map tasks=7167
Total vcore-seconds taken by all reduce tasks=2991
Map-Reduce Framework
CPU time spent (ms)=3780
Combine input records=0
Combine output records=0
Failed Shuffles=0
GC time elapsed (ms)=77
Input split bytes=250
Map input records=7555
Map output bytes=154577
Map output materialized bytes=176267
Map output records=10839
Merged Map outputs=2
Physical memory (bytes) snapshot=656932864
Reduce input groups=19
Reduce input records=10839
Reduce output records=19
Reduce shuffle bytes=176267
Shuffled Maps =2
Spilled Records=21678
Total committed heap usage (bytes)=468189184
Virtual memory (bytes) snapshot=2660089856
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Streaming final output from hdfs://liuyazhuang121:9000/output/ipstat...
"10.2.2.105" 6
"10.2.2.113" 94
"10.2.2.116" 125
"10.2.2.144" 176
"10.2.2.186" 64
"10.2.2.190" 41
"10.2.2.2" 2925
"10.2.2.209" 921
"10.2.2.230" 424
"10.2.2.234" 1889
"10.2.2.24" 733
"10.2.2.250" 2018
"10.2.2.44" 40
"10.2.2.54" 1138
"10.2.2.86" 109
"10.2.2.95" 86
"10.2.2.97" 43
"8.8.3.167" 6
"9.0.6.0" 1
Removing HDFS temp directory hdfs:///user/root/tmp/mrjob/ipstat.root.20180114.091040.605990...
Removing temp directory /tmp/ipstat.root.20180114.091040.605990...
我们可以看到,打印出了相关的结果。
四、验证结果
输入命令:
hadoop fs -ls /output/ipstat
查看输出的结果文件如下:
[root@liuyazhuang121 source]# hadoop fs -ls /output/ipstat
Found 2 items
-rw-r--r-- 1 root supergroup 0 2018-01-14 17:11 /output/ipstat/_SUCCESS
-rw-r--r-- 1 root supergroup 303 2018-01-14 17:11 /output/ipstat/part-00000
此时我们执行命令:hadoop fs -cat /output/ipstat/part-00000
查看输出结果如下:
[root@liuyazhuang121 source]# hadoop fs -cat /output/ipstat/part-00000
"10.2.2.105" 6
"10.2.2.113" 94
"10.2.2.116" 125
"10.2.2.144" 176
"10.2.2.186" 64
"10.2.2.190" 41
"10.2.2.2" 2925
"10.2.2.209" 921
"10.2.2.230" 424
"10.2.2.234" 1889
"10.2.2.24" 733
"10.2.2.250" 2018
"10.2.2.44" 40
"10.2.2.54" 1138
"10.2.2.86" 109
"10.2.2.95" 86
"10.2.2.97" 43
"8.8.3.167" 6
"9.0.6.0" 1
评论暂时关闭