我在 SO 上看到了一些关于这个主题的类似问题,但它们的措辞似乎不正确(example)或用不同的语言(example).
在我的场景中,我认为所有被空白包围的东西都是一个单词。表情符号、数字、不是真正单词的字母串,我不在乎。我只想获取找到的字符串的一些上下文,而不必读取整个文件来确定它是否是有效的匹配。
我尝试使用以下命令,但如果您有一个长文本文件,则需要一段时间才能运行:
text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."
stringr::str_extract(text, "(.*?\\s){1,10}Verulam(\\s.*?){1,10}")
我假设有一种更快/更有效的方法来做到这一点,是吗?
尝试这个:
stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")
#[1] "and created Baron Verulam in 1618[4] and"
更改里面的数字{}
以满足您的需求。
您可以使用非捕获(?:)
团体也是如此,尽管我还不确定这是否会提高速度。
stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)