Python编写网页爬虫爬取oj上的代码信息-白红宇

Python编写网页爬虫爬取oj上的代码信息

阅读量：4568 次

发布时间：2019-06-08

本文共 6277 字，大约阅读时间需要 20 分钟。

OJ升级,代码可能会丢失. 所以要事先备份. 一開始傻傻的复制粘贴, 后来实在不能忍, 得益于大潇的启示和聪神的原始代码, 网页爬虫走起!

已经有段时间没看Python, 这次网页爬虫的原始代码是 python2.7版本号, 试了一下改动到3.0版本号, 要做非常多包的更替,感觉比較烦,所以索性就在这个2.7版本号上完好了.

首先欣赏一下原始代码,我给加了一些凝视:

# -*- coding: cp936 -*-import urllib2import urllibimport reimport threadimport timeimport cookielibcookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)urllib2.install_opener(opener)# 以下是正則表達式部分,意在过滤爬取页面的标签信息class Tool:    A = re.compile(" \;")                           #A-J对标签进行匹配    B = re.compile("\
     
      ")    C = re.compile("<\;")    D = re.compile(">\;")    E = re.compile(""\;")    F = re.compile("&")    G = re.compile("Times\ New\ Roman\"\>")    H = re.compile("\")    I = re.compile("'")    J = re.compile(r'语言.*?face=')    def replace_char(self,x):                      #将标签内容替换成目标内容        x=self.A.sub(" ",x)        x=self.B.sub("\n\t",x)        x=self.C.sub("<",x)        x=self.D.sub(">",x)        x=self.E.sub("\"",x)        x=self.F.sub("&",x)        x=self.G.sub("",x)        x=self.H.sub("",x)        x=self.I.sub("\'",x)        x=self.J.sub("",x)        return xclass HTML_Model:    def __init__(self,u,p):        self.userName = u                 #username与password等登入信息        self.passWord = p        self.mytool = Tool()        self.page = 1                      #从代码页的第一页開始爬        self.postdata = urllib.urlencode({            'userName':self.userName,            'password':self.passWord})    def GetPage(self):        myUrl = "http://acm.njupt.edu.cn/acmhome/login.do"

#请求包括网址和登入表单        req=urllib2.Request(                                   url = myUrl,            data = self.postdata            )

#此次对应为打开这个url        myResponse = urllib2.urlopen(req)

#读取页面        myPage = myResponse.read()        flag = True

#当flag为true时 继续抓取下一页        while flag:

#下一页网址            myUrl="http://acm.njupt.edu.cn/acmhome/showstatus.do?problemId=null&contestId=null&userName="+self.userName+"&result=1&language=&page="+str(self.page)            #print(myUrl)            myResponse = urllib2.urlopen(myUrl)

#打开下一页的页面            myPage = myResponse.read()

#正則表達式搜索是否还有下一页,更新flag. 原理为在当前页查找, 假设当前页面有提交的代码,则含有相似"G++" 这种标签. 也就是说假设我的代码仅仅有84页,那么则在第85页flag-false,不再訪问86页            st="\

#找到当前页面下全部题目代码的连接,放在myItem这个list中            myItem = re.findall(r'

#对于每一个题目代码连接,訪问其所在页面                url='http://acm.njupt.edu.cn/acmhome/solutionCode.do?id='+item[37:len(item)-2]                #print(url)                myResponse = urllib2.urlopen(url)                myPage = myResponse.read()                mytem = re.findall(r'语言.*?.*?Times New Roman\"\>.*?\',myPage,re.S)                #print(mytem)                sName = re.findall(r'源码--.*?

# sname中包括了题号信息                    f = open(sname[2:len(sname)-8]+'.txt','w+')

#通过前面的标签过滤函数,将过滤后的代码写在文件中                    f.write(self.mytool.replace_char(mytem[0]))                    f.close()                    print('done!')            self.page = self.page+1print u'plz input the name'u=raw_input()print u'plz input password'p=raw_input()#u = "B08020129"#p = *******"myModel = HTML_Model(u,p)myModel.GetPage()

如今这个代码有两个问题:

首先,在标签匹配的时候没有支持多行,也就是爬下来的代码中仍然包括跨度多行的标签, 纯代码仍然须要人工提取.

第二,由于代码页面并没有问题的题目信息,所以仅以题号作为文件名称. 这样若果升级后的OJ题目顺序发生改变, 将无法将题目与代码进行相应.

针对第一个问题, 修正的方法比較简单:

在正則表達式匹配的时候, 将第二个參数位置加上re.DOTALL就可以.

比如:

J = re.compile(r'语言.*?face=',re.DOTALL)

对于第二个问题, 能够依据题号寻找题目的页面(而非此前代码的页面), 然后从题目页面中提取标题信息.

在题目页面中,我发现仅仅有标题是用<strong><\strong> 标签修饰的,所以能够这样匹配

sName2=re.findall(r'<strong>([^<]+)</strong>',myPage2,re.S)

另外文件命名的时候不能够有空格,所以还要滤除空格

sname2=sname2.replace(" ","")

即使这样,有时在创建文件时仍然会抛出异常, 可是又一次运行一次可能就会不再出现故障.

以下是晚上后的代码, 改动的地方加粗了.

# -*- coding: cp936 -*-import urllib2import urllibimport reimport threadimport timeimport cookielibcookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)urllib2.install_opener(opener)class Tool:    A = re.compile(" \;")    B = re.compile("\
     
      ")    C = re.compile("<\;")    D = re.compile(">\;")    E = re.compile(""\;")    F = re.compile("&")    G = re.compile("\"Times\ New\ Roman\"\>")    H = re.compile("\")    I = re.compile("'")    J = re.compile(r'语言.*?face=',re.DOTALL)    def replace_char(self,x):        x=self.A.sub(" ",x)        x=self.B.sub("\n\t",x)        x=self.C.sub("<",x)        x=self.D.sub(">",x)        x=self.E.sub("\"",x)        x=self.F.sub("&",x)        x=self.G.sub("",x)        x=self.H.sub("",x)        x=self.I.sub("\'",x)        x=self.J.sub("",x)        return xclass HTML_Model:    def __init__(self,u,p):        self.userName = u        self.passWord = p        self.mytool = Tool()        self.page = 81        self.postdata = urllib.urlencode({            'userName':self.userName,            'password':self.passWord})    def GetPage(self):        myUrl = "http://acm.njupt.edu.cn/acmhome/login.do"        req=urllib2.Request(            url = myUrl,            data = self.postdata            )        myResponse = urllib2.urlopen(req)        myPage = myResponse.read()        flag = True        while flag:            myUrl="http://acm.njupt.edu.cn/acmhome/showstatus.do?problemId=null&contestId=null&userName="+self.userName+"&result=1&language=&page="+str(self.page)            #print(myUrl)            myResponse = urllib2.urlopen(myUrl)            myPage = myResponse.read()            st="\
      
       .*?Times New Roman\"\>.*?\',myPage,re.S)                #print(mytem)                sName = re.findall(r'源码--.*?([^<]+)',myPage2,re.S)                    sname2=sName2[0]                    sname2=sname2.replace(" ","")                   # print(sName)                    print(sname[8:len(sname)-8]+'.'+sname2[0:len(sname2)])                    f = open(sname[8:len(sname)-8]+'.'+sname2[0:len(sname2)]+'.txt','w+')                    f.write(self.mytool.replace_char(mytem[0]))                    f.close()                    print('done!')            print(self.page)            self.page = self.page+1#print u'plz input the name'#u=raw_input()#print u'plz input password'#p=raw_input()u = "LTianchao"p = "******"myModel = HTML_Model(u,p)myModel.GetPage()

关于Python的网页爬取问题,这仅仅是一个非常easy的demo, 以下还须要深入学习.(假设有时间的话)

转载于:https://www.cnblogs.com/mfrbuaa/p/4374113.html

你可能感兴趣的文章