网络智能和大数据公开课Homework3 Map-Reduce编程


hw3data.zip
Each file in this archive contains entries that look like:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.
that represent bibliographic information about publications, formatted as follows:
paper-id:::author1::author2::…. ::authorN:::title
Your task is to compute how many times every term occurs across titles, for each author.
For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2.
Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms.
The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python.
These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively.
I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar.
Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author.

输入范例如下:
conf/fc/KravitzG99:::David W. Kravitz::David M. Goldschlag:::Conditional Access Concepts and Principles.
conf/fc/Moskowitz01:::Scott Moskowitz:::A Solution to the Napster Phenomenon: Why Value Cannot Be Created Absent the Transfer of Subjective Data.
conf/fc/BellareNPS01:::Mihir Bellare::Chanathip Namprempre::David Pointcheval::Michael Semanko:::The Power of RSA Inversion Oracles and the Security of Chaum's RSA-Based Blind Signature Scheme.
conf/fc/Kocher98:::Paul C. Kocher:::On Certificate Revocation and Validation.
conf/ep/BertiDM98:::Laure Berti::Jean-Luc Damoiseaux::Elisabeth Murisasco:::Combining the Power of Query Languages and Search Engines for On-line Document and Information Retrieval : The QIRi@D Environment.
conf/ep/LouS98:::Qun Lou::Peter Stucki:::Funfamentals of 3D Halftoning.
conf/ep/Mather98:::Laura A. Mather:::A Linear Algebra Approach to Language Identification.
conf/ep/BallimCLV98:::Afzal Ballim::Giovanni Coray::A. Linden::Christine Vanoirbeek:::The Use of Automatic Alignment on Structured Multilingual Documents.
conf/ep/ErdenechimegMN98:::Myatav Erdenechimeg::Richard Moore::Yumbayar Namsrai:::On the Specification of the Display of Documents in Multi-lingual Computing.
conf/ep/VercoustreP98:::Anne-Marie Vercoustre::François Paradis:::Reuse of Linked Documents through Virtual Document Prescriptions.
conf/ep/CruzBMW98:::Isabel F. Cruz::Slava Borisov::Michael A. Marks::Timothy R. Webb:::Measuring Structural Similarity Among Web Documents: Preliminary Results.
conf/er/Hohenstein89:::Uwe Hohenstein:::Automatic Transformation of an Entity-Relationship Query Language into SQL.
conf/er/NakanishiHT01:::Yoshihiro Nakanishi::Tatsuo Hirose::Katsumi Tanaka:::Modeling and Structuring Multiple Perspective Video for Browsing.
conf/er/Sciore91:::Edward Sciore:::Abbreviation Techniques in Entity-Relationship Query Languages.
conf/er/Chen79:::Peter P. Chen:::Recent Literature on the Entity-Relationship Approach.



进行处理时,需要开两个客户端。使用的命令分别是:

python mincemeat.py -p pwd localhost
python hw3.py
hw3.py的code为:
import glob
import mincemeat
import operator



all_filepaths = glob.glob('hw3data/*')

def file_contents(filename):
        f = open(filename)
        try:
                return f.read()
        finally:
                f.close()

datasource = dict((filename,file_contents(filename)) for filename in all_filepaths)



def my_mapper(key,value):
   from stopwords import allStopWords
   import re
   for line in value.splitlines():
        allThree=line.split(':::')
        for author in allThree[1].split('::'):
                for word in re.sub(r'([^\s\t0-9a-zA-Z-])+', '',allThree[2]).split():
                        tmpWord=word.strip().lower()
                        if len(tmpWord)<=1 or tmpWord in allStopWords:
                                continue
                        yield (author,tmpWord),1
                
        

def my_reducer(key,value):
   result=sum(value)
   return result


s = mincemeat.Server()
s.datasource = datasource
s.mapfn = my_mapper
s.reducefn = my_reducer

results = s.run_server(password="pwd")
print results

resList=[(x[0],x[1],results[x]) for x in results.keys()]
sorted_results = sorted(resList, key=operator.itemgetter(0,2))



with open('output.txt','w') as f:
        for (a,b,c) in sorted_results:
                f.write(a+' *** '+b+' *** '+str(c)+'\n')
                

输出的结果范例如下:
Stephen L. Bloom *** scalar *** 1
Stephen L. Bloom *** concatenation *** 1
Stephen L. Bloom *** point *** 1
Stephen L. Bloom *** varieties *** 1
Stephen L. Bloom *** observation *** 1
Stephen L. Bloom *** equivalence *** 1
Stephen L. Bloom *** axioms *** 1
Stephen L. Bloom *** languages *** 1
Stephen L. Bloom *** logical *** 1
Stephen L. Bloom *** algebras *** 1
Stephen L. Bloom *** equations *** 1
Stephen L. Bloom *** number *** 1
Stephen L. Bloom *** vector *** 1
Stephen L. Bloom *** polynomial *** 1
Stephen L. Bloom *** solving *** 1
Stephen L. Bloom *** equational *** 1
Stephen L. Bloom *** axiomatizing *** 1
Stephen L. Bloom *** characterization *** 1
Stephen L. Bloom *** regular *** 2
Stephen L. Bloom *** sets *** 2
Stephen L. Bloom *** iteration *** 3
Stephen L. Lieman *** unacceptable *** 1
Stephen L. Lieman *** correcting *** 1
Stephen L. Lieman *** never *** 1
Stephen L. Lieman *** powerful *** 1
Stephen L. Lieman *** accept *** 1



相关内容

    暂无相关文章