这是一个 UTF-8 安全解决方案,它不仅适用于格式正确的文档,还适用于文档片段。
需要 mb_convert_encoding,因为 loadHtml() 似乎有 UTF-8 编码的错误(请参阅here https://stackoverflow.com/questions/3872423/php-problem-with-russian-language/3872663#3872663 and here https://stackoverflow.com/questions/2236889/why-does-dom-change-encoding/2238149#2238149).
mb_substr 正在从输出中修剪主体标记,这样您就可以恢复原始内容而无需任何额外的标记。
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
参考:
1. 通过 php dom 在 html 片段中通过超链接查找并替换关键字 https://stackoverflow.com/questions/3151064/find-and-replace-keywords-by-hyperlinks-in-an-html-fragment-via-php-dom/3151554#3151554
2. Regex / DOMDocument - 匹配和替换不在链接中的文本 https://stackoverflow.com/questions/4044812/regex-domdocument-match-and-replace-text-not-in-a-link/4156573#4156573
3. php 俄语问题 https://stackoverflow.com/questions/3872423/php-problem-with-russian-language/3872663#3872663
4. DOM 为什么要改变编码? https://stackoverflow.com/questions/2236889/why-does-dom-change-encoding/2238149#2238149
我阅读了该主题的数十个答案,所以如果我忘记了某人,我很抱歉(请评论它,在这种情况下我也会添加你的答案)。
感谢戈登和仍然站立的评论我的另一个答案 https://stackoverflow.com/questions/4044812/regex-domdocument-match-and-replace-text-not-in-a-link/4192155#4192155.