python抓取静态网页

时间：2023-05-26

lofter的同人文都是一篇一篇的，懒得找，所以就花了点时间写个爬虫，爬取文本数据存储成本地text。这里主要通过lofter的作者专区文章搜索接口地址进行爬取数据。

示例：我是走高冷路线的该作者的文章搜索地址为：http://sanliubixian.lofter.com/search?q=

后面输入文章名就能搜索到该作者对应的文章。而且还有一个特点，她的文章顺序是根据序号来的，如征服欲1，征服欲2 ...这样，我们就可以进行循环爬取数据了。

1.准备工作

前面踩了很多坑，这里也不一一详细叙述了。我的本地python版本是2.7的。这个注意一下，因为2.7和3.x有一些区别。在这里最主要的区别是使用的urllib模块。这里可以参考一下这位博主。

python 2.xx使用import urllib.request报错no module named request_典笛安的博客-CSDN博客

第二个就是安装web模块，pip install web.py即可安装。

第三个就是编码问题，这里建议使用python的开发工具，我用的是submit text。

其他的就没了，反正就一个py文件，直接上代码吧

index.py

#!/usr/bin/python# -*- coding: UTF-8 -*-import reimport urllibimport urllib2import webimport jsonurls = ( '/', 'hello')app = web.application(urls, globals())# 定义函数def gettext( i ): url = 'http://sanliubixian.lofter.com/search?q=' keyword = i.encode(encoding='utf-8') key_code = urllib.quote(keyword) # 对请求进行编码 url_all = url+key_code header = { 'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } #头部信息 request = urllib2.Request(url_all,headers=header) reponse = urllib2.urlopen(request).read() from bs4 import BeautifulSoup html_doc = reponse; #创建一个BeautifulSoup解析对象 soup = BeautifulSoup(html_doc.replace('', ' '),"html.parser",from_encoding="utf-8") #获取文本 title = soup.find('h2') print title if title==None: print "全文数据抓取完成！！！" return "false" else: p_nodes = soup.find_all('p') fh = open("./"+title.get_text()+".txt","wb") # 将文件写入到当前目录中 fh.write(title.get_text().encode(encoding='utf-8')) fh.write('rn') for p_node in p_nodes: #print p_node.get_text() fh.write(p_node.get_text().encode(encoding='utf-8')) fh.write('rn') fh.close() print "抓取："+title.get_text().encode(encoding='utf-8') return "true"class hello: def __init__(self): web.header('content-type', 'text/json') web.header('Access-Control-Allow-Origin', '*') web.header('Access-Control-Allow-Methods', 'GET, POST') def GET(self): i = web.input(name=None) for num in range(1,30): s=i.name+str(num) result=gettext(s) if result=="false": break ''' t={'msg':'开始爬取数据...','title':title.get_text()} s={} s['data']=t return json.dumps(s,ensure_ascii=False) ''' def POST(self): a = int(web.input().a) b = int(web.input().b) return a + bif __name__ == "__main__": app.run()

运行结果：

码字踩坑不易，转载请注明出处！！谢谢！！

lz初次接触python，自己找资料自己看文档写的，如有不专业之处，还请专业人士见谅

上一篇：IMX6ULL裸机开发学习2-使用C语言点亮LED指示灯

下一篇：Python语言讲解——基础算法上