很久没写爬虫了,利用这次接单来顺便写一下爬虫。
文章目录
- 1.项目需求
- 2.思路梳理
- 3.诗句处理遇到的问题有
- 4.目录结构
- 5.实现步骤
- 6.收获
- 7.不足
1.项目需求
用python实现古诗词填词游戏
诗词库的组成
初中古诗
备注: 诗词库参古诗文网https://so.gushiwen.cn/gushi/chuzhong.aspx
游戏功能:
1)以下玩法
- 诗句对一对:根据上句对下句, 或者根据下句补充上句
给整首诗词, 名句留白, 玩家补充 - 猜名句:给出独字20个, 打乱顺序, 组成名句
- 填词:给出名句, 留白一个字或一个词, 玩家填词
- 首字接一接:给出首字, 补充完整句子或者整首诗词
- 猜作者
- 猜诗名
2.思路梳理
- 获取诗词(以字典形式存储)
- 实现玩法的API
- 实现逻辑
3.诗句处理遇到的问题有
- 因为年级不同,所学诗歌不同,所以需要考虑针对年级进行存储
- 一些诗句含有\n,需删除
- 一些诗句含有注释(下面的‘(随君 一作:随风)’),不利于长句补全,需删除
我寄愁心与明月,随君直到夜郎西。(随君 一作:随风)
4.目录结构
5.实现步骤
import requests
from lxml import etree
class getPoem:
def __init__(self):
self.url = 'https://so.gushiwen.cn/gushi/chuzhong.aspx'
self.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
self.pl = dict()
def get_html(self):
try:
response = requests.get(url=self.url, headers=self.headers)
if response.status_code == 200:
return response.text
else:
print('页面获取错误:状态码{}'.format(response.status_code))
except Exception as e:
raise ConnectionError('获取网页失败!')
def parse(self):
response = self.get_html()
source = etree.HTML(response)
chapters = source.xpath('//div[@class="typecont"]')
return chapters
def get_poem_list(self):
for chapter in self.parse():
grade = chapter.xpath('./div/strong/text()')[0]
titles = chapter.xpath('./span')
sector = list()
for title in titles:
head = title.xpath('./a/text()')[0]
poem_link = title.xpath('./a/@href')[0]
poem = requests.get(poem_link)
page = etree.HTML(poem.text)
author = page.xpath('//p[@class="source"]/a[2]/text()')[0]
content = [word.strip('\n') for word in page.xpath('//div[@class="left"]/div[@class="sons"]['
'1]/div[@class="cont"]/div['
'@class="contson"]//text()')]
sector.append({'title': head, 'author': author, 'content': content})
self.pl.update({grade: sector})
import random
import re
class play_items:
def __init__(self, poem):
self.poem = poem
self.clean_poem()
def clean_poem(self):
pattern = re.compile(r'(\([^)]*\))')
self.poem['content'] = pattern.sub('', self.poem['content'][0])
def fill_sentence(self):
"""
诗句补全
:return:
"""
content = self.poem['content']
content = re.sub(r'(', '(', content)
content = re.sub(r')', ')', content)
pattern = re.compile(r'(\([^)]*\))')
content = [words for words in pattern.sub('', content).split('。') if words != '']
print(content)
sentence = random.choices(content)
target = sentence[0].split(',')[0]
print('请补全下列横线处的诗句:')
response = input('{},'.format(target))
if response == sentence[0].split(',')[1]:
print('恭喜你,答对了!')
else:
print('很遗憾,你答错了!正确答案是: {}。该句出自{}'.format(sentence[0].split(',')[1], self.poem['title']))
def recombine(self):
"""
将单个字组成客观存在或学过的诗句
:return:
"""
content = self.poem['content']
pattern = re.compile(r'(\([^)]*\))')
content = pattern.sub('', content).split('。')
sentence = random.choices(content)
_compile = re.compile('[,。]')
res = _compile.sub('', sentence[0])
res = list(res)
random.shuffle(res)
print('将单个字组成客观存在或学过的诗句: ')
for s in res:
print(s, end=' ')
response = input('\n答案写在这里(使用中文字符):')
if response == sentence[0]:
print('恭喜你,答对了!')
else:
print('很遗憾,你答错了!正确答案是: {}'.format(sentence[0]))
def guess_title(self):
"""
根据诗词猜作者
:return:
"""
content = self.poem['content']
title = self.poem['title']
print('试根据下面的诗歌猜诗名:')
print(''.join(content))
response = input('你的答案是:')
if response == title:
print('恭喜你,答对了!')
else:
print('很遗憾,你答错了!正确答案是: {}'.format(title))
def guess_author(self):
"""
根据诗词猜作者
:return:
"""
content = self.poem['content']
author = self.poem['author']
print('试根据下面的诗歌猜作者:')
print(''.join(content))
response = input('你的答案是:')
if response == author:
print('恭喜你,答对了!')
else:
print('很遗憾,你答错了!作者是: {}'.format(author))
from getPoem import getPoem
from play_items import play_items
import random
import time
print("{:*^40}".format('正在初始化诗歌信息......'))
instance_api = getPoem()
get_poem = instance_api.get_poem_list()
poem_list = instance_api.pl
print("{:*^40}".format('初始化完成......'))
active = True
while active:
player_grade = input('请输入你的年级:')
poem_valid = []
if player_grade == '七上':
poem_valid = poem_list['七年级上册'] + poem_list['七年级上册(课外古诗词诵读)']
elif player_grade == '七下':
poem_valid = poem_list['七年级下册'] + poem_list['七年级下册(课外古诗词诵读)']
elif player_grade == '八上':
poem_valid = poem_list['八年级上册'] + poem_list['八年级上册(课外古诗词诵读)']
elif player_grade == '八下':
poem_valid = poem_list['八年级下册'] + poem_list['八年级下册(课外古诗词诵读)']
elif player_grade == '九上':
poem_valid = poem_list['九年级上册'] + poem_list['九年级上册(课外古诗词诵读)']
elif player_grade == '九下':
poem_valid = poem_list['九年级下册'] + poem_list['九年级下册(课外古诗词诵读)']
else:
print('请重新选择年级:')
random.shuffle(poem_valid)
if poem_valid:
while True:
print('{}'.format('='*40))
print('当前支持以下游戏:\n')
print('1.长句补全')
print('2.诗句重组')
print('3.猜诗名')
print('4.猜作者')
print('5.退出选游戏环节重新选择年级')
print('0.退出游戏')
print('{}'.format('='*40))
game_type = eval(input('你想玩的游戏是(输入相应序号即可):'))
print('{}'.format('='*40))
poem = random.choices(poem_valid)[0]
game = play_items(poem)
if game_type == 0:
active = False
break
if game_type == 1:
game.fill_sentence()
time.sleep(3)
if game_type == 2:
game.recombine()
time.sleep(3)
if game_type == 3:
game.guess_title()
time.sleep(3)
if game_type == 4:
game.guess_author()
time.sleep(3)
if game_type == 5:
break
6.收获
- 对正则表达式实现复杂匹配,即处理诗句问题第三个问题
- 对xpath解析小小的温习了一波(半年没写爬虫了)
7.不足
- 没有采用数据库进行存储,因此每次初始化慢(需要先从网上抓取),且对网站造成一定的影响(主要是我笔记本之前格式化,没有配置MySQL数据库)
- 由于词的字数不对称,以‘。’切割,必然存在问题
- 这是个半成品,大家如果要借鉴请带着我上面说的诗句清洗,和前面两点进行修改和完善
(等下次再有人找我做,我再好好写吧,\吃瓜)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)