Python自动下载人人所有好友的相册
Python自动下载人人所有好友的相册
写的自动抓取自己人人相册的python代码,用途貌似只有备份一下自己的相册。于是今天修改了专门针对人人网的爬虫,增加了自动抓取所有好友的功能,然后去他们的空间,把他(她)们的相册都下载回来(比较适合较多美女朋友的同学们..)...
昨天发的文章有很多标签结果太长了,于是很悲剧地,修改的时候腾讯居然不给提交,XXXXX(省略一万字...)
人人网是个很类似facebook的东东....为什么会很类似,因为中国特色....
转入正题,因为怕以后忘了,所以写下来记录一下...
好,第一点是名词解释。
爬虫是神马?
根据百度百科有: “网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本。.......传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。”
偶针对人人做了一些特化(换句话说拿到其他网站就没用了),人人网要访问首先得有个帐号,也就是说要先登录,然后服务器就可以根据session或cookie来判断你在其他页面的登录情况,而对人人cookie就好了。当然,我们在一个浏览器登录,在另一个浏览器也可能还得要再登录一下,因为一般情况下他们不共享cookie,除非专门去读某个浏览器的cookie。于是爬虫要爬人人,首先要登录.....然后保存cookie。
浏览器与服务器之间通讯主要都是Http协议,方法主要有GET和POST,(据《深入理解计算机系统》说,GET方法占了99%的HTTP请求。),GET方法主要向服务器发送比较短的数据,主要将参数写到URL里面,而POST方法则可以发送比较长的数据,例如发这篇文章的话,则是用了POST。想我们可以用"Telnet www.google.com 80",然后键入"Get /"就可以可以收到和我们在浏览器打上"http://www.google.com/"同样的东西。爬虫也一样,就是不断地GET,POST……
要抓取所有好友的所有可见的相册有两种方法,一种是人工一个好友一个好友一个相册一个相册地下,另一种就是就给计算机让它自己去爬....因为我比较懒,所以选择第二种方法。
又到了“要怎么怎么样,首先怎么怎么样”的句式了~
要获取所有好友,可以在登录的情况下访问http://friend.renren.com/myfriendlistx.do,如果有用浏览器登录的话,好友会被javascript分成很多页显示。在网页的某段javascript中有个变量叫friends,保存所有好友的信息,里面都是{"id":254905709,"vip":false,"selected":true,"mo":true,"name":"\u5b89\u8feaAndy","head":"http:\/\/hdn.xnimg.cn\/photos\/hdn321\/20110612\/1600\/h_tiny_zFLc_715e000281932f76.jpg","groups":["\u534e\u5357\u7406\u5de5\u5927\u5b66"]}这种元组,从这里,我们可以获取所有好友的id。
要获取某个人的所有相册,可以访问http://www.renren.com/profile.do?id=(某人的id)&v=photo_ajax&undefined,这个是怎么找出来的呢?我们登录一个人的主页时,然后点击相册,这个页面并没有刷新,只是由AJAX替换了页面的一部分,它就是去Get那个路径,就返回了网页的一部分代码过来,替换掉现在的。所以我们也可以去Get那个路径,就可以获得包含所有相册id的页面。
要获取一个相册里面的所有照片,这个要靠人人的一个Bug了,很无意发现的,你可以打开别人相册的排序照片的页面。在排序的页面,一个相册所有的照片都列出来了,通过正则表达式,我们就可以拿到每张照片的id。排序的页面为http://photo.renren.com/photo/(某人的id)/album-(相册id)/reorder。
经过了三句“要怎么怎么样,首先怎么怎么样”,我们拿到了所有好友的id,所有好友的所有相册的id,和所有好友的所有相册的所有照片的id。为什么都是id呢?这个个人觉得用一个整数作为数据库元组的主码,性能会高些,而且对于一个32位整数,只占4字节,就可以标识4294967296个东西了。加上在客户与服务器之间传送id也方便。
拥有这些id我们可以做什么,目前什么都做不了,我们访问http://photo.renren.com/photo/(某人的id)/photo-(相片id)就可以在网页中代码中发现AJAX返回的一段代码代码中有一句"largeurl":"http:\/\/fmn.rrimg.com\/fmn049\/20110621\/1520\/p_large_S5jA_37eb000165dc5c3f.jpg",这就是一张照片的真正地址了,然后我们把里面的"\"给删掉就可以下载了。
相关文件下载:
免费下载地址在 http://linux.bkjia.com/
用户名与密码都是www.bkjia.com
具体下载目录在 /pub/2011/08/25/Python自动下载人人所有好友的相册/
好,于是我们就可以这样写出一个残缺不全的爬虫了..........对于人人的新鲜事,可以把一个页面的url抓出来筛选后放到一个优先队列里,再从优先队列里选一个最优的进入,重复上一步,直到队列为空或者其他情况....呃,传说中的中文伪代码....
程序在Ubuntu 11.04下测试正常,在Windows下可能会有乱码....
- # -*-coding:utf-8-*-
- # Filename:main.py
- # 作者:华亮
- #
- from Renren import SuperRenren
- import time
- def main():
- renren = SuperRenren()
- if renren.Create('人人帐号', '人人密码'):
- #renren.PostMsg(time.asctime())
- #renren.PostGroupMsg('387635422', '%s' % time.asctime())
- #renren.DownloadAlbum('333982368', 'sss')
- renren.DownloadAllFriendsAlbums()
- if __name__ == '__main__':
- main()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
- <pre name="code" class="python"># -*- coding:utf-8 -*-
- # Filename:Renren.py
- # 作者:华亮
- #
- from HTMLParser import HTMLParser
- from Queue import Empty
- from Queue import Queue
- from re import match
- from sys import exit
- from urllib import urlencode
- import os
- import re
- import socket
- import threading
- import time
- import urllib
- import urllib2
- import shelve
- # 提供给输出的互斥对象
- GlobalPrintMutex = threading.Lock()
- # 提供输出config.cfg的互斥对象
- GlobalWriteConfigMutex = threading.Lock()
- # 提供保存用户最后更新的互斥对象
- GlobalShelveMutex = threading.Lock()
- # 根据平台不同选择不同的路径分割符
- Delimiter = '/' if os.name == 'posix' else '\\'
- ConfigFilename = 'config.cfg' # 每个相册的已经下载的图片id
- LastUpdatedFileName = 'lastupdated.cfg' # 所有人的最后更新时间
- UpdateThreashold = 10 * 60 # 更新时间
- # 多核情况下的输出
- def MutexPrint(content):
- GlobalPrintMutex.acquire()
- print content
- GlobalPrintMutex.release()
- def MutexWriteFile(file, content):
- GlobalWriteConfigMutex.acquire()
- file.write(content)
- file.flush()
- GlobalWriteConfigMutex.release()
- # 字符串形式的unicode转成真正的字符
- def Str2Uni(str):
- import re
- pat = re.compile(r'\\u(\w{4})')
- lst = pat.findall(str)
- lst.insert(0, '')
- return reduce(lambda x,y: x + unichr(int(y, 16)), lst)
- #------------------------------------------------------------------------------
- # 下载文件的下载者
- class Downloader(threading.Thread):
- def __init__(self, urlQueue, failedQueue, file=None):
- threading.Thread.__init__(self)
- self.queue = urlQueue
- self.failedQueue = failedQueue
- self.file = file
- def run(self):
- try:
- while not self.queue.empty():
- pid, url, filename = self.queue.get()
- isfile = os.path.isfile(filename)
- MutexPrint(("\tDownloading %s" if not isfile else "\tExists %s") % filename.decode('utf-8'))
- if not isfile: urllib.urlretrieve(url, filename.decode('utf-8'))
- MutexWriteFile(self.file, pid + '\n')
- except Empty:
- pass
- except Exception, e:
- self.failedQueue.put(pid)
- MutexPrint('\tError occured when downloading photo which id = %s' % pid)
- MutexPrint(e)
- #------------------------------------------------------------------------------
- # 人人相册的解析
- class RenrenAlbums(HTMLParser):
- in_key_div = False
- in_ul = False
- in_li = False
- in_a = False
- albumsUrl = []
- def handle_starttag(self, tag, attrs):
- attrs = dict(attrs)
- if tag == 'div' and 'class' in attrs and attrs['class'] == 'big-album album-list clearfix':
- self.in_key_div = True
- elif self.in_key_div:
- if tag == 'ul':
- self.in_ul = True
- elif self.in_ul and tag == 'li':
- self.in_li = True
- if self.in_li and tag == 'a' and 'href' in attrs:
- self.in_a = True
- self.albumsUrl.append(attrs['href'])
- def handle_data(self, data):
- pass
- def handle_endtag(self, tag):
- if self.in_key_div and tag == 'div':
- self.in_key_div = False
- elif self.in_ul and tag == 'ul':
- self.in_ul = False
- elif self.in_li and tag == 'li':
- self.in_li = False
- elif self.in_a and tag == 'a':
- self.in_a = False
- class RenrenRequester:
- '''''
- 人人访问器
- '''
- LoginUrl = 'http://www.renren.com/PLogin.do'
- # 输入用户和密码的元组
- def Create(self, username, password):
- loginData = {'email':username,
- 'password':password,
- 'origURL':'',
- 'formName':'',
- 'method':'',
- 'isplogin':'true',
- 'submit':'登录'}
- postData = urlencode(loginData)
- cookieFile = urllib2.HTTPCookieProcessor()
- self.opener = urllib2.build_opener(cookieFile)
- req = urllib2.Request(self.LoginUrl, postData)
- result = self.opener.open(req)
- if not (result.geturl() == 'http://www.renren.com/home' or 'http://guide.renren.com/guide'):
- return False
- rawHtml = result.read()
- # 获取用户id
- useridPattern = re.compile(r'user : {"id" : (\d+?)}')
- self.userid = useridPattern.search(rawHtml).group(1)
- # 查找requestToken
- pos = rawHtml.find("get_check:'")
- if pos == -1: return False
- rawHtml = rawHtml[pos + 11:]
- token = match('-\d+', rawHtml)
- if token is None:
- token = match('\d+', rawHtml)
- if token is None: return False
- self.requestToken = token.group()
- self.__isLogin = True
- return self.__isLogin
- def GetRequestToken(self):
- return self.requestToken
- def GetUserId(self):
- return self.userid
- def Request(self, url, data = None):
- if self.__isLogin:
- if data:
- encodeData = urlencode(data)
- request = urllib2.Request(url, encodeData)
- else:
- request = urllib2.Request(url)
- result = self.opener.open(request)
- return result
- else:
- return None
- class RenrenPostMsg:
- '''''
- RenrenPostMsg
- 发布人人状态
- '''
- newStatusUrl = 'http://status.renren.com/doing/updateNew.do'
- def Handle(self, requester, param):
- requestToken, msg = param
- statusData = {'content':msg,
- 'isAtHome':'1',
- 'requestToken':requestToken}
- postStatusData = urlencode(statusData)
- requester.Request(self.newStatusUrl, statusData)
- return True
- class RenrenPostGroupMsg:
- '''''
- RenrenPostGroupMsg
- 发布人人小组状态
- '''
- newGroupStatusUrl = 'http://qun.renren.com/qun/ugc/create/status'
- def Handle(self, requester, param):
- requestToken, groupId, msg = param
- statusData = {'minigroupId':groupId,
- 'content':msg,
- 'requestToken':requestToken}
- requester.Request(self.newGroupStatusUrl, statusData)
- class RenrenFriendList:
- '''''
- RenrenFriendList
- 人人好友列表
- '''
- def Handler(self, requester, param):
- friendUrl = 'http://friend.renren.com/myfriendlistx.do'
- rawHtml = requester.Request(friendUrl).read()
- friendInfoPack = re.search(r'var friends=\[(.*?)\];', rawHtml).group(1)
- friendIdPattern = re.compile(r'"id":(\d+).*?"name":"(.*?)"')
- friendIdList = []
- for id, name in friendIdPattern.findall(friendInfoPack):
- friendIdList.append((id, Str2Uni(name)))
- return friendIdList
- class RenrenAlbumDownloader:
- '''''
- AlbumDownloader
- 相册下载者,记录已经下载的照片id到config.cfg,不会重新下载
- '''
- threadNumber = 10 # 下载线程数
- def Handler(self, requester, param):
- self.requester = requester
- userid, path = param
- self.__DownloadOneAlbum(userid, path)
- # 解析html获取人名
- def __GetPeopleNameFromHtml(self, rawHtml):
- peopleNamePattern = re.compile(r'<h2>(.*?)<span>')
- # 取得人名
- peopleName = peopleNamePattern.search(rawHtml).group(1).strip()
- return peopleName
- def __GetAlbumsNameFromHtml(self, rawHtml):
- albumUrlPattern = re.compile(r'<a href="(.*?)" stats="album_album"><img.*?/>(.*?)</a>')
- albums = []
- # 把相册路径定向到排序页面,就可以在那个页面获得该相册下所有的相片的id
- for album_url, album_name in albumUrlPattern.findall(rawHtml):
- albums.append((album_name.strip(), album_url + '/reorder'))
- return albums
- def __GetAlbumPhotos(self, userid, albumUrl):
- # 匹配的正则表达式
- # 照片id
- pidPattern = re.compile(r'<li pid="(\d+)".*?>.*?</li>', re.S)
- # 访问所有包含所有相册的页面
- result = self.requester.Request(albumUrl)
- rawHtml = result.read()
- photohtmlurl = [] # 每张照片的页面
- for pid in pidPattern.findall(rawHtml):
- photohtmlurl.append((pid, 'http://photo.renren.com/photo/%s/photo-%s' % (userid, pid)))
- return photohtmlurl
- def __GetRealPhotoUrls(self, photohtmlurl):
- # 访问每个相册,获取所有照片,并修正相片的url
- # 照片地址
- imgPattern = re.compile(r'"largeurl":"(.*?)"')
- imgUrl = [] # id与真实照片的url
- for pid, url in photohtmlurl:
- result = self.requester.Request(url)
- rawHtml = result.read()
- for img in imgPattern.findall(rawHtml):
- imgUrl.append((pid, img.replace('\\', '')))
- return imgUrl
- def __DownloadAlbum(self, savepath, album_name, imgUrl, file):
- # 下载相册所有图片
- # 将下载文件压入队列
- queue = Queue()
- failedQueue = Queue()
- for pid, url in imgUrl:
- imgname = url.split('/')[-1]
- queue.put((pid, url, savepath + Delimiter + imgname))
- # 启动多线程下载
- threads = []
- for i in range(self.threadNumber):
- downloader = Downloader(queue, failedQueue, file)
- threads.append(downloader)
- downloader.start()
- # 等待所有线程完成
- for t in threads:
- t.join()
- # 返回相片队列
- return failedQueue
- # 下载某人的相册
- def __DownloadOneAlbum(self, userid, path='albums'):
- #if not self.__isLogin: return
- if os.path.exists(path.decode('utf-8')) == False: os.mkdir(path.decode('utf-8'))
- albumsUrl = 'http://www.renren.com/profile.do?id=%s&v=photo_ajax&undefined' % userid
- try:
- # 取出相册和路径
- result = self.requester.Request(albumsUrl)
- rawHtml = result.read()
- # 取得人名
- peopleName = self.__GetPeopleNameFromHtml(rawHtml).strip()
- albums = self.__GetAlbumsNameFromHtml(rawHtml)
- # 根据人名建文件夹
- path += Delimiter + peopleName
- if os.path.exists(path.decode('utf-8')) == False: os.mkdir(path.decode('utf-8'))
- # 开始进入相册下载
- MutexPrint('Enter %s' % peopleName.decode('utf-8'))
- for album_name, albumUrl in albums:
- MutexPrint('Downloading Album: %s' % album_name.decode('utf-8'))
- # 获取该相册下照片id和照片地址的表
- photohtmlurl = self.__GetAlbumPhotos(userid, albumUrl)
- # 按相册名建文件夹
- album_name = album_name.replace('\\', '')
- album_name = album_name.replace('/', '')
- savepath = path + Delimiter + album_name
- if os.path.exists(savepath.decode('utf-8')) == False: os.mkdir(savepath.decode('utf-8'))
- #
- newDownloadIdSet = set()
- finishedIdSet = set()
- totalIdSet = set()
- for pid, url in photohtmlurl:
- totalIdSet.add(pid)
- configFile = savepath + Delimiter + ConfigFilename
- if os.path.isfile(configFile):
- # 读取已经完成的照片以免重复访问获取大图地址的页面
- file = open(configFile.decode('utf-8'), 'r')
- photoIdMap = []
- for line in file.readlines():
- line = line.strip()
- pid = line
- photoIdMap.append(pid)
- file.close()
- finishedIdSet = set(photoIdMap)
- newDownloadIdSet = totalIdSet - finishedIdSet
- newDownloadPhotoHtmlUrl = ((pid, url) for pid, url in photohtmlurl if pid in newDownloadIdSet)
- imgUrl = self.__GetRealPhotoUrls(newDownloadPhotoHtmlUrl)
- # 下载照片
- try:
- file = open(configFile.decode('utf-8'), 'w')
- for id in finishedIdSet:
- file.write(id + '\n')
- file.flush()
- failedQueue = self.__DownloadAlbum(savepath, album_name, imgUrl, file)
- except Exception, e:
- print 'Error when downloading.', e
- finally:
- # 取出下载失败的的照片的id
- while not failedQueue.empty():
- totalIdSet.remove(failedQueue.get())
- file.close()
- except AttributeError, e:
- raise
- except Exception, e:
- print 'Error! Please contact QQ: 414112390'
- print e
- class AutoRenrenDownloader:
- '''''
- AutoRenrenDownloader
- 自动下载所有好友相册,具有断点续传功能,一次下载为完成,第二次会接着下
- '''
- def handler(self, requester, param):
- self.requester = requester
- path, threadnumber = param
- self.__DownloadFriendsAlbums(path, threadnumber)
- #------------------------------------------------------------------------------
- # 好友相册下载者
- class FriendDownloader(threading.Thread):
- def __init__(self, requester, queue, file):
- threading.Thread.__init__(self)
- self.file = file
- self.requester = requester
- self.queue = queue
- def run(self):
- try:
- while not self.queue.empty():
- id, path = self.queue.get()
- downloader = RenrenAlbumDownloader()
- downloader.Handler(self.requester, (id, path))
- GlobalShelveMutex.acquire()
- self.file['TaskList'].remove(id)
- GlobalShelveMutex.release()
- except Empty:
- pass
- except AttributeError, e:
- print '有可能已经被人人网认为访问了100个好友,请访问人人网的任意好友的主页输入验证码'
- #print e
- except ValueError, e:
- print id
- print e
- def __DownloadFriendsAlbums(self, path='albums', threadnumber=10):
- if not os.path.exists(path.decode('utf-8')): os.mkdir(path.decode('utf-8'))
- friendsList = RenrenFriendList().Handler(self.requester, None)
- db = shelve.open(LastUpdatedFileName, writeback = True)
- if not db.has_key('TaskList'): db['TaskList'] = []
- if len(db['TaskList']) == 0:
- db['TaskList'] = [id for id, realName in friendsList]
- updateList = db['TaskList']
- i = 1
- print "此次需要更新如下:"
- # 获取好友列表
- queue = Queue()
- for id in updateList:
- print "%s:\t%s\t" % (i, id),
- print '' if os.name == 'nt' else dict(friendsList)[id]
- i += 1
- queue.put((id, path))
- # 下载好友
- DownloadersList = []
- failedQueue = Queue()
- try:
- for i in range(threadnumber):
- friendDownloader = self.FriendDownloader(self.requester, queue, db)
- friendDownloader.start()
- DownloadersList.append(friendDownloader)
- for downloader in DownloadersList:
- downloader.join()
- except Exception, e:
- print '-' * 100 + "\nPlease Goto Renren.com\n" + '-' * 100
- print e
- finally:
- db.close()
- class SuperRenren:
- '''''
- SuperRenren
- 人人控制器
- '''
- # 创建
- def Create(self, username, password):
- self.requester = RenrenRequester()
- if self.requester.Create(username, password):
- self.userid = self.requester.userid
- self.requestToken = self.requester.requestToken
- return True
- return False
- # 发送个人状态
- def PostMsg(self, msg):
- poster = RenrenPostMsg()
- poster.Handle(self.requester, (self.requestToken, msg))
- # 发送小组状态
- def PostGroupMsg(self, groupId, msg):
- poster = RenrenPostGroupMsg()
- poster.Handle(self.requester, (self.requestToken, groupId, msg))
- # 下载相册
- def DownloadAlbum(self, userId, path = 'albums'):
- downloader = RenrenAlbumDownloader()
- downloader.Handler(self.requester, (userId, path))
- # 自动下载所有好友相册
- def DownloadAllFriendsAlbums(self, path = 'albums', threadnumber = 10):
- downloader = AutoRenrenDownloader()
- downloader.handler(self.requester, (path, threadnumber))
评论暂时关闭