Python之——网站访问来源IP统计,python来源ip统计


转载请注明出处:http://blog.csdn.net/l1028386804/article/details/79057671

一、场景描述

数据源准备工作详见博文《Python之——自动上传本地log文件到HDFS(基于Hadoop 2.5.2)》。

统计用户的访问来源IP可以更好的了解用户的分布,同时也可以帮助安全人员捕捉攻击来源。实现的原理是:定义匹配IP正则,将匹配到的字符串作为key,将value初始化为1,执行redecuer操作时做累加统计。

二、实现MapReduce

【/usr/local/python/source/ipstat.py】

# -*- coding:UTF-8 -*-
'''
Created on 2018年1月14日

@author: liuyazhuang
'''

from mrjob.job import MRJob
import re

#定义IP正则匹配
IP_RE = re.compile(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")

class MRCount(MRJob):
    
    def mapper(self, key, line):
        #匹配IP正则后生成key:value,其中key为IP地址,value初始值为1
        for ip in IP_RE.findall(line):
            yield ip, 1
    
    def reducer(self, ip, occurrences):
        yield ip, sum(occurrences)
        

if __name__ == '__main__':
    MRCount.run()

三、生成MapReduce任务

执行命令:

 python ipstat.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/ipstat hdfs://liuyazhuang121:9000/user/root/website.com/20180114
打印的日志如下:

[root@liuyazhuang121 source]# python ipstat.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/ipstat hdfs://liuyazhuang121:9000/user/root/website.com/20180114                    
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Found hadoop binary: /usr/local/hadoop-2.5.2/bin/hadoop
Using Hadoop version 2.5.2
Looking for Hadoop streaming jar in /usr/local/hadoop-2.5.2...
Found Hadoop streaming jar: /usr/local/hadoop-2.5.2/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar
Creating temp directory /tmp/ipstat.root.20180114.091040.605990
Copying local files to hdfs:///user/root/tmp/mrjob/ipstat.root.20180114.091040.605990/files/...
Running step 1 of 1...
  packageJobJar: [/usr/local/hadoop-2.5.2/tmp/hadoop-unjar4828642106994965791/] [] /tmp/streamjob4775985125407933464.jar tmpDir=null
  Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
  Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
  Total input paths to process : 1
  number of splits:2
  Submitting tokens for job: job_1515893542122_0010
  Submitted application application_1515893542122_0010
  The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0010/
  Running job: job_1515893542122_0010
  Job job_1515893542122_0010 running in uber mode : false
   map 0% reduce 0%
   map 100% reduce 0%
   map 100% reduce 100%
  Job job_1515893542122_0010 completed successfully
  Output directory: hdfs://liuyazhuang121:9000/output/ipstat
Counters: 49
        File Input Format Counters 
                Bytes Read=2355499
        File Output Format Counters 
                Bytes Written=303
        File System Counters
                FILE: Number of bytes read=176261
                FILE: Number of bytes written=657303
                FILE: Number of large read operations=0
                FILE: Number of read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2355749
                HDFS: Number of bytes written=303
                HDFS: Number of large read operations=0
                HDFS: Number of read operations=9
                HDFS: Number of write operations=2
        Job Counters 
                Data-local map tasks=2
                Launched map tasks=2
                Launched reduce tasks=1
                Total megabyte-seconds taken by all map tasks=7339008
                Total megabyte-seconds taken by all reduce tasks=3062784
                Total time spent by all map tasks (ms)=7167
                Total time spent by all maps in occupied slots (ms)=7167
                Total time spent by all reduce tasks (ms)=2991
                Total time spent by all reduces in occupied slots (ms)=2991
                Total vcore-seconds taken by all map tasks=7167
                Total vcore-seconds taken by all reduce tasks=2991
        Map-Reduce Framework
                CPU time spent (ms)=3780
                Combine input records=0
                Combine output records=0
                Failed Shuffles=0
                GC time elapsed (ms)=77
                Input split bytes=250
                Map input records=7555
                Map output bytes=154577
                Map output materialized bytes=176267
                Map output records=10839
                Merged Map outputs=2
                Physical memory (bytes) snapshot=656932864
                Reduce input groups=19
                Reduce input records=10839
                Reduce output records=19
                Reduce shuffle bytes=176267
                Shuffled Maps =2
                Spilled Records=21678
                Total committed heap usage (bytes)=468189184
                Virtual memory (bytes) snapshot=2660089856
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
Streaming final output from hdfs://liuyazhuang121:9000/output/ipstat...
"10.2.2.105"    6
"10.2.2.113"    94
"10.2.2.116"    125
"10.2.2.144"    176
"10.2.2.186"    64
"10.2.2.190"    41
"10.2.2.2"      2925
"10.2.2.209"    921
"10.2.2.230"    424
"10.2.2.234"    1889
"10.2.2.24"     733
"10.2.2.250"    2018
"10.2.2.44"     40
"10.2.2.54"     1138
"10.2.2.86"     109
"10.2.2.95"     86
"10.2.2.97"     43
"8.8.3.167"     6
"9.0.6.0"       1
Removing HDFS temp directory hdfs:///user/root/tmp/mrjob/ipstat.root.20180114.091040.605990...
Removing temp directory /tmp/ipstat.root.20180114.091040.605990...
我们可以看到,打印出了相关的结果。

四、验证结果

输入命令:

 hadoop fs -ls /output/ipstat
查看输出的结果文件如下:

[root@liuyazhuang121 source]# hadoop fs -ls /output/ipstat
Found 2 items
-rw-r--r--   1 root supergroup          0 2018-01-14 17:11 /output/ipstat/_SUCCESS
-rw-r--r--   1 root supergroup        303 2018-01-14 17:11 /output/ipstat/part-00000
此时我们执行命令:
hadoop fs -cat /output/ipstat/part-00000
查看输出结果如下:

[root@liuyazhuang121 source]# hadoop fs -cat /output/ipstat/part-00000
"10.2.2.105"    6
"10.2.2.113"    94
"10.2.2.116"    125
"10.2.2.144"    176
"10.2.2.186"    64
"10.2.2.190"    41
"10.2.2.2"      2925
"10.2.2.209"    921
"10.2.2.230"    424
"10.2.2.234"    1889
"10.2.2.24"     733
"10.2.2.250"    2018
"10.2.2.44"     40
"10.2.2.54"     1138
"10.2.2.86"     109
"10.2.2.95"     86
"10.2.2.97"     43
"8.8.3.167"     6
"9.0.6.0"       1

相关内容

    暂无相关文章