有个博客很详细https://blog.csdn.net/weixin_42488570/article/details/80794087
要求:用户ID,用户等级,用户性别,发表段子文字信息,好笑数量和评论数量,如下图所示:![](https://img-blog.csdnimg.cn/20210304203807571.png)
用户ID
user = re.findall('<h2.*?>(.*?)</h2>', text, flags=re.DOTALL)
![](https://img-blog.csdnimg.cn/20210304203926667.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NlcmVhc3Vlc3Vl,size_16,color_FFFFFF,t_70)
文字
text = re.findall('<div class="content">.*?<span>(.*?)</span>', text, re.S)
![](https://img-blog.csdnimg.cn/2021030420503336.png)
import requests
from lxml import etree
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'
# 'referer': 'https://dytt8.net/html/gndy/dyzz/list_23_2.html'
}
def judgment_sex(class_name):
if class_name == 'womenIcon':
return '女'
else:
return '男'
def parse_page(url):
response = requests.get(url, headers=headers)
text = response.text
users = re.findall('<h2.*?>(.*?)</h2>', text, flags=re.DOTALL)
sexs = re.findall('<div class="articleGender(.*?)">', text, re.S)
contents = re.findall('<div class="content">.*?<span>(.*?)</span>', text, re.S)
laughs = re.findall('<i class="number.*?>(\d+)</i>', text, flags=re.DOTALL)
info_lists = []
for value in zip(users, sexs, contents, laughs):
user, sex, content, laugh = value
info = {
'user': user,
'sex': judgment_sex(sex),
'content': content,
'laugh': laugh
}
info_lists.append(info)
print(info_lists)
#保存到本地,可以不保存
for info_list in info_lists:
f = open('C:\\Users\\wei\\Desktop\\qiushi.txt', 'a+')
try:
f.write(info_list['user'] + '\n')
f.write(info_list['sex'] + '\n')
f.write(info_list['content'] + '\n')
f.write(info_list['laugh'] + '\n')
f.close()
except UnicodeEncodeError:
pass
def spider():
url = 'https://www.qiushibaike.com/text/page/2/'
parse_page(url)
if __name__ == '__main__':
spider()
结果
![](https://img-blog.csdnimg.cn/20210305142942138.png)
![](https://img-blog.csdnimg.cn/20210305143001818.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NlcmVhc3Vlc3Vl,size_16,color_FFFFFF,t_70)
我们可以看到输出结果和空格
优化去掉其他的字符串
![](https://img-blog.csdnimg.cn/20210305143844172.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NlcmVhc3Vlc3Vl,size_16,color_FFFFFF,t_70)
修改代码如下
![](https://img-blog.csdnimg.cn/20210305143900165.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NlcmVhc3Vlc3Vl,size_16,color_FFFFFF,t_70)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)