BeautifulSoup库的简单使用

2023-05-16

BeautifulSoup是什么

网页解析库BeautifulSoup，用来解析和提取网页中的数据。

安装BeautifulSoup库

BeautifulSoup库目前已经进阶到第4版（Beautiful Soup 4），由于它不是Python标准库，而是第三方库，需要单独安装它，在自己的电脑上运行，需要在终端输入一行代码运行：pip install BeautifulSoup4。（Mac电脑需要输入pip3 install BeautifulSoup4）

解析数据

from bs4 import BeautifulSoup
soup = BeautifulSoup(字符串,'html.parser')#第0个参数必须是str类型，第1个参数是解析器

提取数据

find()与find_all()；Tag对象（标签对象）
find()与find_all()是BeautifulSoup对象的两个方法，它们可以匹配html的标签和属性，把BeautifulSoup对象里符合要求的数据都提取出来。
find()只提取首个满足要求的数据。find()方法将代码从上往下找，找到符合条件的第一个数据，不管后面还有没有满足条件的其他数据，停止寻找，立即返回。
而find_all()顾名思义（find all：查找全部），提取出的是所有满足要求的数据。代码从上往下找，一直到代码的最后，把所有符合条件的数据揣好，一起打包返回。

find()与find_all()

find()	find_all()
作用：提取满足要求的首个数据	作用：提取满足要求的所有数据
用法：BeautifulSoup对象.find(标签，属性)	用法：BeautifulSoup对象.find_all(标签，属性)
用例：soup.find(‘div’,class_=‘books’)	用例：soup.find_all(‘div’,class_=‘books’)

括号中的参数：标签和属性可以任选其一，也可以两个一起使用，这取决于我们要在网页中提取的内容。一个参数可以定位内容就使用一个，需要两个参数一起使用才能准确定位的话就使用两个。

import requests
from bs4 import BeautifulSoup
res = requests.get(url) #此处url 为网址
soup = BeautifulSoup( res.text,'html.parser')
# 查看soup的类型
print(type(soup))  #此处打印结果为<class 'bs4.BeautifulSoup'>
item = soup.find('div') 
# 打印item的数据类型
print(type(item))#此处打印结果为<class 'bs4.element.Tag'>
# 打印item  
print(item)  #此处为第一个出现的div块内容

import requests
from bs4 import BeautifulSoup
res = requests.get(url) #此处url 为网址
soup = BeautifulSoup( res.text,'html.parser')
# 查看soup的类型
print(type(soup))  #此处打印结果为<class 'bs4.BeautifulSoup'>
item = soup.find_all('div') 
# 打印item的数据类型
print(type(item))#此处打印结果为<class 'bs4.element.ResultSet'>
# 打印item  
print(item)  #此处为一个div块的列表

Tag对象

Tag.find()和Tag.find_all()	用来提取Tag中的Tag
Tag.text	用来提取Tag中的文字
Tag[‘属性名’]	用来提取Tag中的属性值

import requests
from bs4 import BeautifulSoup
res = requests.get(url) 
soup = BeautifulSoup( res.text,'html.parser')
# 查看soup的类型
print(type(soup))  
items = soup.find_all(class_='books') 
# 打印item的数据类型
print(type(items))
# 打印item  

for item in items:
    kind=item.find('h2')
    title=item.find(class_="title")
    brief=item.find(class_="info")
    # 打印提取出的数据
    print(kind.text,'\n',title['href'],'\n',title.text,'\n',brief.text) 
    # 打印提取出的数据类型
    print(type(kind),type(title),type(brief))

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)