BeautifulSoup - 抓取论坛页面

2024-05-27

我正在尝试抓取论坛讨论并将其导出为 csv 文件，其中包含“线程标题”、“用户”和“帖子”等行，其中后者是每个人的实际论坛帖子。

我是 Python 和 BeautifulSoup 的初学者，所以我对此感到非常困难！

我当前的问题是 csv 文件中的所有文本都被拆分为每行一个字符。有人可以帮助我吗？如果有人能帮助我，那就太好了！

这是我一直在使用的代码：

from bs4 import BeautifulSoup
import csv
import urllib2

f = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")

soup = BeautifulSoup(f)

b = soup.get_text().encode("utf-8").strip() #the posts contain non-ascii words, so I had to do this

writer = csv.writer(open('silkroad.csv', 'w'))
writer.writerows(b)

好的，我们开始吧。不太确定我在这里帮助您做什么，但希望您有充分的理由来分析丝绸之路帖子。

这里有一些问题，其中最大的问题是您根本没有解析数据。您使用 .get_text() 所做的实质上是转到页面，突出显示整个内容，然后将整个内容复制并粘贴到 csv 文件中。

所以这就是你应该尝试做的事情：

阅读页面源码
用汤把它分成你想要的部分
将作者、日期、时间、帖子等部分保存在并行数组中
逐行将数据写入csv文件

我写了一些代码来向您展示它的样子，它应该可以完成这项工作：

from bs4 import BeautifulSoup
import csv
import urllib2

# get page source and create a BeautifulSoup object based on it
print "Reading page..."
page = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")
soup = BeautifulSoup(page)

# if you look at the HTML all the titles, dates, 
# and authors are stored inside of <dt ...> tags
metaData = soup.find_all("dt")

# likewise the post data is stored
# under <dd ...>
postData = soup.find_all("dd")

# define where we will store info
titles = []
authors = []
times = []
posts = []

# now we iterate through the metaData and parse it
# into titles, authors, and dates
print "Parsing data..."
for html in metaData:
    text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
    titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
    authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
    times.append(text.split(" on ")[1].strip()) # get date

# now we go through the actual post data and extract it
for post in postData:
    posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

# now we write data to csv file
# ***csv files MUST be opened with the 'b' flag***
csvfile = open('silkroad.csv', 'wb')
writer = csv.writer(csvfile)

# create template
writer.writerow(["Time", "Author", "Title", "Post"])

# iterate through and write all the data
for time, author, title, post in zip(times, authors, titles, posts):
    writer.writerow([time, author, title, post])


# close file
csvfile.close()

# done
print "Operation completed successfully."

EDIT:包含的解决方案可以从目录读取文件并使用其中的数据

好的，您的 HTML 文件已经在一个目录中了。您需要获取目录中的文件列表，迭代它们，并将目录中的每个文件附加到 csv 文件。

这是我们新程序的基本逻辑。

如果我们有一个名为 processData() 的函数，该函数将文件路径作为参数，并将文件中的数据附加到 csv 文件，则如下所示：

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # increment counter

碰巧我们的处理数据（）功能或多或少与我们之前所做的一样，只是做了一些更改。

所以这与我们的上一个程序非常相似，只是有一些小变化：

我们首先写列标题
接下来，我们打开带有“ab”标志的 csv 进行追加
我们导入 os 来获取文件列表

看起来是这样的：

from bs4 import BeautifulSoup
import csv
import urllib2
import os # added this import to process files/dirs

# ** define our data processing function
def processData( pageFile ):
    ''' take the data from an html file and append to our csv file '''
    f = open(pageFile, "r")
    page = f.read()
    f.close()
    soup = BeautifulSoup(page)

    # if you look at the HTML all the titles, dates, 
    # and authors are stored inside of <dt ...> tags
    metaData = soup.find_all("dt")

    # likewise the post data is stored
    # under <dd ...>
    postData = soup.find_all("dd")

    # define where we will store info
    titles = []
    authors = []
    times = []
    posts = []

    # now we iterate through the metaData and parse it
    # into titles, authors, and dates
    for html in metaData:
        text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
        titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
        authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
        times.append(text.split(" on ")[1].strip()) # get date

    # now we go through the actual post data and extract it
    for post in postData:
        posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

    # now we write data to csv file
    # ***csv files MUST be opened with the 'b' flag***
    csvfile = open('silkroad.csv', 'ab')
    writer = csv.writer(csvfile)

    # iterate through and write all the data
    for time, author, title, post in zip(times, authors, titles, posts):
        writer.writerow([time, author, title, post])

    # close file
    csvfile.close()
# ** start our process of going through files

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # incriment counter

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

beautifulsoup

screenscraping

BeautifulSoup - 抓取论坛页面的相关文章

Python 的键盘中断不会中止 Rust 函数 (PyO3)

我有一个使用 PyO3 用 Rust 编写的 Python 库它涉及一些昂贵的计算单个函数调用最多需要 10 分钟从 Python 调用时如何中止执行 Ctrl C 好像只有执行结束后才会处理所以本质上没什么用最小可重现示例 Ca
Django 管理员在模型编辑时间歇性返回 404

我们使用 Django Admin 来维护导出到我们的一些站点的一些数据有时当单击标准更改列表视图来获取模型编辑表单而不是路由到正确的页面时我们会得到 Django 404 页面模板它是偶尔发生的我们可以通过重新加载三次来重现它
SQLAlchemy 通过关联对象声明式多对多自连接

我有一个用户表和一个朋友表它将用户映射到其他用户因为每个用户可以有很多朋友这个关系显然是对称的如果用户A是用户B的朋友那么用户B也是用户A的朋友我只存储这个关系一次除了两个用户 ID 之外 Friends 表还有其他字段因此
Python（Selenium）：如何通过登录重定向/组织登录登录网站

我不是专业程序员所以请原谅任何愚蠢的错误我正在做一些研究我正在尝试使用 Selenium 登录数据库来搜索大约 1000 个术语我有两个问题 1 重定向到组织登录页面后如何使用 Selenium 登录 2 如何检索数据库在我解决
Flask 会话变量

我正在用 Flask 编写一个小型网络应用程序当两个用户在同一网络下尝试使用应用程序时我遇到会话变量问题这是代码 import os from flask import Flask request render template
如何使用 OpencV 从 Firebase 读取图像？

有没有使用 OpenCV 从 Firebase 读取图像的想法或者我必须先下载图片然后从本地文件夹执行 cv imread 功能有什么办法我可以使用cv imread link of picture from firebase 您可以
BeautifulSoup 中的嵌套标签 - Python

我在网站和 stackoverflow 上查看了许多示例但找不到解决我的问题的通用解决方案我正在处理一个非常混乱的网站我想抓取一些数据标记看起来像这样 table tbody tr tr tr td td td table tr t
添加不同形状的 numpy 数组

我想添加两个不同形状的 numpy 数组但不进行广播而是将缺失值视为零可能最简单的例子是 1 2 3 2 gt 3 2 3 or 1 2 3 2 1 gt 3 2 3 1 0 0 我事先不知道形状我正在弄乱每个 np shape
如何在ipywidget按钮中显示全文？

我正在创建一个ipywidget带有一些文本的按钮但按钮中未显示全文我使用的代码如下 import ipywidgets as widgets from IPython display import display button wid
Flask如何获取请求的HTTP_ORIGIN

我想用我自己设置的 Access Control Allow Origin 标头做出响应而弄清楚请求中的 HTTP ORIGIN 参数在哪里似乎很混乱我在用着烧瓶 0 10 1 以及HTTP ORIGIN似乎是这个的特点之一object
在Python中获取文件描述符的位置

比如说我有一个原始数字文件描述符我需要根据它获取文件中的当前位置 import os psutil some code that works with file lp lib open path to file p psutil Pro
Fabric env.roledefs 未按预期运行

On the 面料网站 http docs fabfile org en 1 10 usage execution html 给出这个例子 from fabric api import env env roledefs web hosts
将图像分割成多个网格

我使用下面的代码将图像分割成网格的 20 个相等的部分 import cv2 im cv2 imread apple jpg im cv2 resize im 1000 500 imgwidth im shape 0 imgheight i
每个 X 具有多个 Y 值的 Python 散点图

我正在尝试使用 Python 创建一个散点图其中包含两个 X 类别 cat1 cat2 每个类别都有多个 Y 值如果每个 X 值的 Y 值的数量相同我可以使用以下代码使其工作 import numpy as np import mat
如何在 Python 中追加到 JSON 文件？

我有一个 JSON 文件其中包含 67790 1 kwh 319 4 现在我创建一个字典a dict我需要将其附加到 JSON 文件中我尝试了这段代码 with open DATA FILENAME a as f json obj js
有人用过 Dabo 做过中型项目吗？ [关闭]

Closed 这个问题是基于意见的 help closed questions 目前不接受答案我们正处于一个新的 ERP 风格的客户端服务器应用程序的开始阶段该应用程序是作为 Python 富客户端开发的我们目前正在评估 Dabo
使用 Python 绘制 2D 核密度估计

I would like to plot a 2D kernel density estimation I find the seaborn package very useful here However after searching
Python：如何将列表列表的元素转换为无向图？

我有一个程序可以检索 PubMed 出版物列表并希望构建一个共同作者图这意味着对于每篇文章我想将每个作者如果尚未存在添加为顶点并添加无向边或增加每个合著者之间的权重我设法编写了第一个程序该程序检索每个出版物的作者列表并
使用其构造函数初始化 OrderedDict 以便保留初始数据的顺序的正确方法？

初始化有序字典 OD 以使其保留初始数据的顺序的正确方法是什么 from collections import OrderedDict Obviously wrong because regular dict loses order d O
NotImplementedError：无法将符号张量 (lstm_2/strided_slice:0) 转换为 numpy 数组。时间

张量流版本 2 3 1 numpy 版本 1 20 在代码下面 define model model Sequential model add LSTM 50 activation relu input shape n steps n fe

随机推荐

如何在 Istio 上禁用 mtls？

我在使用 Istio 连接 Kubernetes 上的两个服务时遇到问题我的服务向 elasticsearch 发出 POST 请求 2020 11 18T21 51 53 758079131Z org elasticsearch cli
从不同的形式调用过程

我正在使用 Lazarus 我有一个名为TForm1单元名称为 Unit 1 在这里我有一个名为mergeDATfile a shortint 这会产生一些东西顺便说一句我必须创建另一个名为TForm2里面有按钮 Button1 当它被
Android 中的 FFMpeg jni？

我已经构建了 Bambuser http bambuser com opensource 提供的 FFMPEG 可执行文件和库所以我设法构建了 Android 可执行文件和库如何在 Eclipse 项目中链接这些库并从 Java 调用
未构建 csproj 时抑制 AfterBuild 目标

我在 MSBuild 中有一个构建后目标来复制一些构建输出这是 linkedin 作为对AfterBuild目标暴露于Microsoft CSharp targets
在 JSF 自定义验证器中区分 ajax 请求和完整请求

我的验证器需要知道它是完整请求还是 ajax 请求在我当前的解决方案中我检查 http 请求标头X Requested With元素 public void validate FacesContext context UICompone
评级栏更改星星颜色而不使用自定义图像

有什么办法可以改变星星的颜色吗我不想使用自定义图像来实现它您可以将这些行添加到创建方法中 RatingBar ratingBar RatingBar findViewById R id ratingBar LayerDrawable s
Lua 的标准（或最好支持的）大数（任意精度）库是什么？

我正在处理大量无法四舍五入的数字使用 Lua 的标准数学库似乎没有方便的方法来保持精度超过某些内部限制我还看到有几个库可以加载以处理大数字 http oss digirati com br luabignum http oss dig
EasyMock : java.lang.IllegalStateException: 1 个匹配器预期，2 个记录

我在使用 EasyMock 2 5 2 和 JUnit 4 8 2 通过 Eclipse 运行时遇到问题我已阅读此处所有类似的帖子但尚未找到答案我有一个包含两个测试的类它们测试相同的方法我正在使用匹配器每个测试单独运行时都会通
MAMP Python-MySQLdb 问题：调用 Python 文件后 libssl.1.0.0.dylib 的路径发生变化

我正在尝试使用 python MySQLdb 访问 MAMP 服务器上的 MySQL 数据库当我最初尝试使用 python sql 调用 Python 文件来访问 MAMP 上的数据库时我得到了image not found关于错误li
如何重命名 bash 函数？

我正在围绕另一个定义 bash 函数的软件包开发一些方便的包装器我想用我自己的同名函数替换他们的 bash 函数同时仍然能够从我的函数中运行他们的函数换句话说我需要重命名它们的函数或者为其创建某种持久别名当我创建同名函数时该别
两列宽度可变且它们之间的间隙固定

我需要动态设置两列的样式它们各自的宽度应为 50 但它们之间的固定间隙为 10px 当我折叠菜单时列应加宽至可用空间并且间隙应保持 10 像素因此列不能采用固定宽度我试过这个 container background red w
如何使用 VB.NET 或 C#.NET 代码从 yahoo 邮件 ID 发送邮件

我想从我的 yahoomail Id 发送邮件如何使用 VB NET 或 C NET 代码从 yahoo mail Id 发送邮件需要善意的帮助提前谢谢西瓦库马尔以下是一些制作基本 html 电子邮件消息的示例 http help
在 NLTK Python 的朴素贝叶斯分类器中使用文档长度

我正在使用 Python 中的 NLTK 构建垃圾邮件过滤器现在我检查单词的出现情况并使用 NaiveBayesClassifier 其准确度为 0 98 垃圾邮件的 F 测量值为 0 92 非垃圾邮件的 F 测量值为 0 98 然而
send() 使我的程序崩溃

我正在运行服务器和客户端我正在我的计算机上测试我的程序这是服务器中向客户端发送数据的函数 int sendToClient int fd string msg cout lt lt sending to client lt lt fd
SQL 选择与带有通配符的 URL 匹配的行

我在数据库中有一个表其中一列包含一个 URL 例如http example com users http example com users 轮廓我得到了一个 URL 例如http example com users 234 profi
coreplot 栏点击不工作

我从 Github 下载了这段代码 https github com gilthonweapps CorePlotBarChartExample https github com gilthonweapps CorePlotBarChart
如何在开头时解析 json 文件

我想解析以下 JSON 文件但以向我表明这是一个数组然后继续对象我当前的解析器返回一个 JSON 对象我的问题是如何修改解析器来解析这个文件这样解析器将为我提供其他 JSON 文件从对象或排列开始 JSON 文件 codi
如何使用PHP在服务器端缩小图像？

我有一些从服务器提取的图像 imgUrl保存图像的路径现在我用 img src width 100 height 200 或 CSS 来缩小图像但我想在 PHP 中执行此操作以便将已缩放的图像提供给 DOM 有任何想法吗 Thanks
检查url图片是否存在

我正在尝试使用 if 语句检查 url 图像是否存在然而当尝试通过错误的图像网址测试它时它会不断返回致命错误在解包可选值时意外发现 nil code var httpUrl subJson image url stringValu
BeautifulSoup - 抓取论坛页面

我正在尝试抓取论坛讨论并将其导出为 csv 文件其中包含线程标题用户和帖子等行其中后者是每个人的实际论坛帖子我是 Python 和 BeautifulSoup 的初学者所以我对此感到非常困难我当前的问题是 csv 文件中

BeautifulSoup - 抓取论坛页面

BeautifulSoup - 抓取论坛页面 的相关文章

随机推荐

热门标签

BeautifulSoup - 抓取论坛页面的相关文章