如何通过维基百科 API 获取特定部分的文本

2024-01-06

我只想从维基百科页面中提取特定部分：

例子：我想从维基百科文章“House”的“Parts”部分中提取文本。

https://en.wikipedia.org/wiki/House https://en.wikipedia.org/wiki/House

结果文本将是：

Many houses have several large rooms  .....  sections of the home (including in more recent eras a garage).

我们可以从如下文章中获取洞文本：

但如何获取特定部分的文本呢？

您是否需要纯文本 wiki 文本或解析器生成的 HTML？

下面的示例为您提供了“布局”部分（内部文章的第三部分，您也可以使用任何其他部分 ID）。

当你想检索特定部分的已解析 html 时，你应该使用 parse api：或者，作为沙箱外部的 API 请求：

如果您想要特定部分的 wikitext，只需使用 wikitext 属性而不是 text 属性：

为了知道哪个部分有什么索引，您可以使用“sections”属性查询此信息，而不需要任何部分索引：

因此，作为仅使用 API 检索布局部分文本的完整示例，您将：

检索文章的章节：

回复：

{
    "parse": {
        "title": "House",
        "pageid": 13590,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Etymology",
                "number": "1",
                "index": "1",
                "fromtitle": "House",
                "byteoffset": 3549,
                "anchor": "Etymology"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Elements",
                "number": "2",
                "index": "2",
                "fromtitle": "House",
                "byteoffset": 4960,
                "anchor": "Elements"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "Layout",
                "number": "2.1",
                "index": "3",
                "fromtitle": "House",
                "byteoffset": 4976,
                "anchor": "Layout"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "Parts",
                "number": "2.2",
                "index": "4",
                "fromtitle": "House",
                "byteoffset": 6432,
                "anchor": "Parts"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "History of the interior",
                "number": "2.3",
                "index": "5",
                "fromtitle": "House",
                "byteoffset": 7539,
                "anchor": "History_of_the_interior"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "Communal rooms",
                "number": "2.3.1",
                "index": "6",
                "fromtitle": "House",
                "byteoffset": 8786,
                "anchor": "Communal_rooms"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "Interconnecting rooms",
                "number": "2.3.2",
                "index": "7",
                "fromtitle": "House",
                "byteoffset": 9736,
                "anchor": "Interconnecting_rooms"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "Corridor",
                "number": "2.3.3",
                "index": "8",
                "fromtitle": "House",
                "byteoffset": 11126,
                "anchor": "Corridor"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "Employment-free house",
                "number": "2.3.4",
                "index": "9",
                "fromtitle": "House",
                "byteoffset": 13092,
                "anchor": "Employment-free_house"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "Work location, technology and doctors",
                "number": "2.4",
                "index": "10",
                "fromtitle": "House",
                "byteoffset": 15969,
                "anchor": "Work_location,_technology_and_doctors"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "Technology and privacy",
                "number": "2.4.1",
                "index": "11",
                "fromtitle": "House",
                "byteoffset": 17291,
                "anchor": "Technology_and_privacy"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Construction",
                "number": "3",
                "index": "12",
                "fromtitle": "House",
                "byteoffset": 18782,
                "anchor": "Construction"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "Energy efficiency",
                "number": "3.1",
                "index": "13",
                "fromtitle": "House",
                "byteoffset": 21899,
                "anchor": "Energy_efficiency"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "Earthquake protection",
                "number": "3.2",
                "index": "14",
                "fromtitle": "House",
                "byteoffset": 23057,
                "anchor": "Earthquake_protection"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Found materials",
                "number": "4",
                "index": "15",
                "fromtitle": "House",
                "byteoffset": 25172,
                "anchor": "Found_materials"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Legal issues",
                "number": "5",
                "index": "16",
                "fromtitle": "House",
                "byteoffset": 26235,
                "anchor": "Legal_issues"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "United Kingdom",
                "number": "5.1",
                "index": "17",
                "fromtitle": "House",
                "byteoffset": 26644,
                "anchor": "United_Kingdom"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Identifying houses",
                "number": "6",
                "index": "18",
                "fromtitle": "House",
                "byteoffset": 26922,
                "anchor": "Identifying_houses"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Animal houses",
                "number": "7",
                "index": "19",
                "fromtitle": "House",
                "byteoffset": 27397,
                "anchor": "Animal_houses"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Houses and symbolism",
                "number": "8",
                "index": "20",
                "fromtitle": "House",
                "byteoffset": 27826,
                "anchor": "Houses_and_symbolism"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "See also",
                "number": "9",
                "index": "21",
                "fromtitle": "House",
                "byteoffset": 28620,
                "anchor": "See_also"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "References",
                "number": "10",
                "index": "22",
                "fromtitle": "House",
                "byteoffset": 29690,
                "anchor": "References"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "External links",
                "number": "11",
                "index": "23",
                "fromtitle": "House",
                "byteoffset": 29720,
                "anchor": "External_links"
            }
        ]
    }
}

迭代结果并找到您想要的部分，检索索引
在下一个 API 请求中使用索引来获取该部分内容：

回复：

{
    "parse": {
        "title": "House",
        "pageid": 13590,
        "wikitext": {
            "*": "=== Layout ===\n[[File:Gingerbread House Essex CT.jpg|thumb|Example of an early [[Victorian architecture|Victorian]] \"Gingerbread House\" in [[Connecticut]], United States, built in 1855]]\n\nIdeally, [[architect]]s of houses design [[room]]s to meet the needs of the people who will live in the house. [[Feng shui]], originally a [[China|Chinese]] method of moving houses according to such factors as rain and micro-climates, has recently expanded its scope to address the design of interior spaces, with a view to promoting harmonious effects on the people living inside the house, although no actual effect has ever been demonstrated. Feng shui can also mean the \"aura\" in or around a dwelling, making it comparable to the [[real estate|real-estate]] sales concept of \"indoor-outdoor flow\".\n\nThe [[square footage]] of a house in the United States reports the area of \"living space\", excluding the garage and other non-living spaces. The \"square metres\" figure of a house in Europe <!-- including Malta ? --> reports the area of the walls enclosing the home, and thus includes any attached garage and non-living spaces.<ref>{{Cite book|title=Land Management: Challenges and Strategies (First Edition)|last=Iyyer|first=Chaitanya|publisher=Global India Publications Pvt Ltd|year=2009|isbn=978-9380228488|location=|pages=}}</ref>{{Citation needed|date=February 2007}} The number of floors or levels making up the house can affect the square footage of a home."
        }
    }
}

背景：页面中的部分的想法尚未集成到修订中，修订“只是”整个页面的内容和附加元数据（例如在多个其他插槽中），但部分是内容的一部分（这是仅修订版中的一个位置）。这就是为什么当使用修订查询 API 时，您只能获取整个文本。需要解析页面才能知道节是什么，因为节是维基文本的概念，因此涉及解析器。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

mediawiki

wikipedia

wikipediaapi

如何通过维基百科 API 获取特定部分的文本的相关文章

多流维基百科转储

我下载了德语维基百科转储 dewiki 20151102 pages articles multistream xml 我的简短问题是在这种情况下多流意味着什么转储使用 bz2 进行压缩 bz2 支持并行版本使其能够更快地压缩解
如何从维基百科获取信息框数据？

如果我有某个页面的 URL 我如何使用 MediaWiki Web 服务获取右侧的信息框信息通过此 Python 库使用 Mediawiki API https github com siznax wptools https github
网页到印刷媒体——解决方案？

我一直是 MediaWiki 和类似的基于 wiki 的文本编辑器的忠实粉丝我喜欢快速添加文本协作和共享的功能然而仍然需要格式良好的打印输出像页眉和页脚表达我想要它们表达的内容分页符页边距等我见过的大多数解决方案都涉及到某
有没有API可以从wiki页面获取图像

我想从维基百科页面获取主图像我有所有维基百科实体名称我从中创建维基链接并从该页面获取主图像我尝试过 https github com richardasaurus wiki api https github com richardas
维基百科文本下载

我正在寻找为我的大学项目下载完整的维基百科文本我是否必须编写自己的蜘蛛才能下载此内容或者是否有在线维基百科的公共数据集为了给你一些我的项目的概述我想找出我感兴趣的几篇文章中有趣的单词但是为了找到这些有趣的单词我计划应用 tf i
将 ipython 笔记本转换为 mediawiki

我想将 ipython 笔记本转换为 mediawiki 标记我有两个想法如何做到这一点自定义导出nbconvert tool 先导出为LaTeX 然后使用pandoc将其转换为 mediawiki 标记我在第一个选项中找不到任何内容
维基百科 api 全文搜索返回带有标题、片段和图像的文章

我一直在寻找一种基于搜索字符串来查询 wikipedia api 的方法以获取具有以下属性的文章列表 Title 片段描述与文章相关的一张或多张图片我还必须使用 jsonp 进行查询我尝试过使用 list search 参数但它
使用 MediaWiki 从 Wikia 页面提取文本，但返回结果一片混乱，是否有更好的方法可以从每个部分提取文本？

我正在开发一个 Android 应用程序它从 Wikia 页面提取信息并将其显示在应用程序中我目前正在拉动所有类别进行导航并将我的应用程序设置为在 WebView 中显示页面但我只想拉动信息并自行格式化而不是通过传递到 WebVi
在 Mediawiki 中上传时调整图像大小

理想情况下我在上传到 mediawiki 的所有图像上调整大小并设置 maxWidth 和 maxHeight 后看完了文档 https www mediawiki org wiki Manual Image administratio
如何从维基百科文章中提取数据？

我有一个关于为我的 Android 应用程序解析维基百科数据的问题我有一个脚本可以通过读取源代码来下载 XMLhttp en wikipedia org w api php action parse prop text format x
如何使用维基百科的 API 获取维基百科内容？

我想获取维基百科文章的第一段执行此操作的 API 查询是什么请参阅本节中的MediaWiki API 文档 http www mediawiki org wiki API Properties revisions 2F rv 具体涉及获
使用 Python 更新媒体 wiki 文章？

你好我有一个 cron 作业它收集有关服务的一些统计信息我需要 cron 作业以编程方式更新媒体 wiki 页面附加到页面我在 cron 中使用 python 那么我最好的选择是什么是否有 mediawiki python 库的
获取 JSON 中未知值后面的字符串

我使用维基百科的 API 来获取有关页面的信息 API 给了我这样的 JSON query pages 188791 pageid 188791 ns 0 title Vanit u00e9 langlinks lang bg Vanita
如何在mediawiki中找到图像路径？

我想在 mediawiki 中找到图像的确切 URL 以便在我的 pinterest 代码中发送为了查找页面 URL 我使用 urlencode wgTitle gt getFullURL 但我无法弄清楚用于图像和图像描述的代码谢谢要
将等号（'='）传递给 MediaWiki 模板中的参数

如何在模板参数中使用字符而不破坏模板解析器我不是 MediaWIKI 开发人员所以我没有调试代码或检查日志我希望这里有人提供转义传递给模板的字符的提示使用以下内容创建一个名为 Test 的模板 1 像这样 Test R 3 2 1
从 Wikipedia XML 转储获取静态 HTML 文件

我希望能够从巨大的即使是压缩的英语维基百科 XML 转储文件中获取相对最新的静态 HTML 文件enwiki 最新 pages articles xml bz2 http download wikimedia org enwiki la
将 Sandcastle 帮助文件生成器输出（网站）转换为 MediaWiki 格式

我需要转换我的 Sandcastle 帮助文件 Web 生成器 SHFB 输出站点 HTML 转媒体 wiki 格式找到一种方法来转移包含直接将转换后的页面转换为我们已经建立了 MediaWiki 有任何想法吗我们的网站中有超
Python 中维基百科 API 中的 DisambiguationError 和 GuessedAtParserWarning

我想获得维基百科与搜索词相关的可能且可接受的名称列表在这种情况下是电晕当输入以下内容时 print wikipedia summary Corona 这给出了以下输出 home virej local lib python3 8 si
维基百科模板参数中的等号无法正确显示

我注意到使用带有等号的链接似乎无法正常工作当链接放置在 missing information 模板有什么方法可以解决此限制以便可以将带有等号的链接包含在 MediaWiki 模板中 missing information https
使用 AJAX 获取特定 DOM 元素（使用 Javascript，而不是 jQuery）

如何使用 AJAX 用简单的 JavaScript 语言 NOTjQuery 获取页面同一域并仅显示特定的 DOM 元素比如id为 bodyContent 标记的DOM元素我正在使用 MediaWiki 1 18 所以我的方法必须稍

随机推荐

Javascript / Nodejs 在 Nodejs 模块的顶层使用等待

我尝试找到问题的解决方案但找不到它并且正在寻找一些最佳实践示例我有一个 Nodejs Express 应用程序我的函数被分割在文件中例如我有这个控制器 oktacontroller js var okta api key
IE 和 Firefox 中的按钮大小不相等

我的 jsp 页面上有几个按钮我使用的样式为 buttonblue background color 003366 border color 99CCFF color FFFFFF font family Verdana Arial He
对如何处理 CORS OPTIONS 预检请求感到困惑

我是跨源资源共享的新手并试图让我的网络应用程序响应 CORS 请求我的 web 应用程序是在 Tomcat 7 0 42 上运行的 Spring 3 2 应用程序在我的 web 应用程序的 web xml 中我启用了 Tomcat
IISExpress 8 无法读取配置文件redirection.config

我正在使用 IISExpress8 运行 Windows Server 2008 R2 x64 当导航到 c Program Files x86 IIS Express gt iisexpress exe it says 文件名重定向 c
Rails Chartkick：只需要轴上的整数值。使用离散的还是其他的？

假设我有以下代码使用 Chartkick 这会产生以下图表我希望使用整数来标记垂直轴不是小数我认为discrete选项应该这样做但对于这个例子它所做的只是将水平轴上元素的格式从时间更改为数字即以下代码产生这个所以我的问题是
使用 TOR 运行 python 脚本

大家好首先我想确保有类似的主题但没有公认的答案或明确的回应所以我想把它们结合起来再问一遍我有以下脚本 import urllib2 proxy urllib2 ProxyHandler http 127 0 0 1 9050 ope
我想使用PHP的PDO将数据插入mysql数据库。但数据没有插入

我想使用PHP的PDO将数据插入mysql数据库但数据没有插入我以前使用过PDO 但没有遇到任何问题但在下面的例子中我不明白我哪里做错了谁能帮帮我吗输出显示良好有回声
如何在 R 中为 Quantstrat 编写自定义规则函数 - 将追踪止损订单替换为 stoplimit 和ruleOrderProc

我的目标是使用下面概述的规则来生成信号来放置新的 stoplimit 订单来取代我的追踪止损我不希望我的止损无限期地跟踪直到它达到我的盈亏平衡价格如果已经可以以某种方式实现这一点请告诉我我希望在 quantstrat 中编写一个自
如何通过短信或彩信将超链接发送到手机

我一直在寻找一种方法通过短信或彩信将具有不同外观的 URL 的超链接发送到手机我能想到的唯一例子如下 url http www google co uk Click Here url 所以上面的代码会显示点击这里当您单击单击此处
如何使用 JSON-simple (Java) 判断返回是 JSONObject 还是 JSONArray？

我正在访问一项服务有时会得到这样的结果 param1 value1 param2 value2 有时会得到这样的回报 param1 value1 param2 value2 param1 value1 param2 value2 我如何知
无法在 Gerrit 中合并

每当我向 Gerrit 发送评论并且该评论等待一段时间时我都会收到cannot mergeGerrit 中的消息我理解它的到来因为其他人会更改相同的文件并在我之前交付我正在尝试以下解决方法来解决我的问题放弃当前的审查创建一个新的
GEKKO RTO 与 MPC 模式

这是一个由此衍生的问题one https stackoverflow com questions 60761440 variable bounds in mpc with gekko 在发布我的问题后我找到了一个解决方案更像是强制优化器
将 Admob 添加到 libgdx

RelativeLayout layout new RelativeLayout this AndroidApplicationConfiguration config new AndroidApplicationConfiguration
在 Chrome 扩展内容脚本中使用 Dart 无法运行？

我正在尝试使用 Dart 编写 Chrome 扩展到目前为止除了内容脚本之外一切都很顺利内容脚本 dart 文件中的 main 函数似乎没有运行更具体地说首先 Dartium 无法使用因为在清单中的 js 规则中给出 dart
学习编程语言的工作原理

我已经编程多年主要是Python 但我不明白当我编译或执行代码时幕后发生了什么本着question https stackoverflow com questions 1514812 gentle introduction to ope
Eclipse 调试器跳转到错误的返回语句

我遇到了一个非常奇怪的情况我正在 Android 2 1 平台上用 Java 通过 Eclipse Galileo 执行以下操作 Get gravity geomagnetic data to return to the caller f
Python：csvwriter 的问题

我正在尝试将数据主要是日期布尔值和浮点数据类型写入 CSV 文件格式这是我的代码片段 Write data to file with open OUTPUT DIR output filename w as outputfile w
合并提交第一和第二父母

在涉及相对提交引用的 Udacity 课程中它说表示父提交表示第一个父提交和之间的主要区别在于提交的时间由合并创建合并提交有两个父项通过合并 commit 时引用用于指示该文件的第一个父级提交而 2 表示第二个父级第
NSPropertyListSerialization propertyListWithData 产生不兼容的转换警告/错误

我正在尝试使用以下代码从 plist 中读取数据 NSString error NSData tempData NSData alloc initWithContentsOfFile Data plist NSDictionary temp
如何通过维基百科 API 获取特定部分的文本

我只想从维基百科页面中提取特定部分例子我想从维基百科文章 House 的 Parts 部分中提取文本 https en wikipedia org wiki House https en wikipedia org wiki House

如何通过维基百科 API 获取特定部分的文本

如何通过维基百科 API 获取特定部分的文本 的相关文章

随机推荐

热门标签

如何通过维基百科 API 获取特定部分的文本的相关文章