python爬虫—用数据解析bs4爬取整部三国演义(不用诗词名句网)
需求:使用bs4实现将三国演义小说的每一章的内容爬取到本地磁盘进行存储
诗词名句网无法进去,所以我自己找了个网站爬取,思路差不多。
首先,对首页的页面数据进行爬取
url = 'http://sanguo.5000yan.com/baihuawen/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
page_text=requests.get(url=url,headers=headers).text
请注意这里的headers,就是UA伪装
![在这里插入图片描述](https://img-blog.csdnimg.cn/20210204210631610.PNG?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3p6dzIzMzMzMw==,size_16,color_FFFFFF,t_70#pic_center)
然后开始解析这个标题
soup=BeautifulSoup(page_text,'lxml')
#解析章节标题
li_list=soup.select('.list > ul > li > a')
#print(li_list)
fp = open('./sanguo.txt','w',encoding='utf-8')#为后续存储到本地磁盘做准备
for a in li_list:
title=a.string
detail_url ='http://sanguo.5000yan.com/' + a['href']
#对详情页发起请求,解析出章节内容
detail_page_text = requests.get(url=detail_url,headers=headers).text
为什么soup.select(’.list > ul > li > a’)可见图中所画,
同时我们会得到新的url,即detail_url =‘http://sanguo.5000yan.com/’ + a[‘href’]
然后我们解析出每章的内容,
detail_soup = BeautifulSoup(detail_page_text,'lxml')
div_tag = detail_soup.find('div',class_='grap')
#解析到了章节的内容
content = div_tag.text
![在这里插入图片描述](https://img-blog.csdnimg.cn/2021020421194695.PNG?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3p6dzIzMzMzMw==,size_16,color_FFFFFF,t_70#pic_center)
最后存储到本地磁盘中
fp.write(title+':'+content+'\n')
print(title,'爬取成功')
如果就这样结束,你们爬取出来的代码会是乱码的,这里需要在响应数据下面加上以下两段代码:
page_text=page_text.encode("ISO-8859-1")
page_text=page_text.decode('utf-8')
完整的代码如下:
import lxml
import requests
from bs4 import BeautifulSoup
if __name__=='__main__':
#对首页的页面数据进行爬取
url = 'http://sanguo.5000yan.com/baihuawen/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
page_text=requests.get(url=url,headers=headers).text
page_text=page_text.encode("ISO-8859-1")
page_text=page_text.decode('utf-8')
#在首页中解析出章节的标题和详情页的url
#实例化BeautifulSoup对象,需要将页面源码数据加载到该对象中
soup=BeautifulSoup(page_text,'lxml')
#解析章节标题
li_list=soup.select('.list > ul > li > a')
#print(li_list)
fp = open('./sanguo.txt','w',encoding='utf-8')
for a in li_list:
title=a.string
detail_url ='http://sanguo.5000yan.com/' + a['href']
#对详情页发起请求,解析出章节内容
detail_page_text = requests.get(url=detail_url,headers=headers).text
detail_page_text=detail_page_text.encode("ISO-8859-1")
detail_page_text=detail_page_text.decode('utf-8')
#解析出详情页中相关的章节内容
detail_soup = BeautifulSoup(detail_page_text,'lxml')
div_tag = detail_soup.find('div',class_='grap')
#解析到了章节的内容
content = div_tag.text
fp.write(title+':'+content+'\n')
print(title,'爬取成功')