Selenium driver.page_source() 仅提取部分 HTML DOM

2024-05-22

我有一个网页,当我右键单击它然后查看页面源时,我得到:SECTION-A

但是当我点击它然后检查时,我得到了更长的输出,我尝试使用 JS 获取页面源,但同样的问题,我得到了输出SECTION-A... 我怎样才能解决这个问题?

注意:我正在寻找通用解决方案,而不仅仅是针对这个特定网站。

我尝试过的:

time.sleep(3)
html1 = driver.execute_script("return document.documentElement.outerHTML")
html2 = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
html3 = driver.page_source()

我使用的是 chrome,这个问题有任何标志或解决方案吗?


SECTION-A:

<head><script language="javascript" type="text/javascript">
var framePara = new Array(
0,
"main.htm",
1,
0,0 );
</script>
<script language="javascript" type="text/javascript">
var indexPara = new Array(
"192.168.0.1",
1742822853,
"tplinklogin.net",
0,0 );
</script>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<title>TL-WR845N</title>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="wed, 26 Feb 1997 08:21:57 GMT">
<link href="../dynaform/css_main.css" rel="stylesheet" type="text/css">
<script language="javascript" src="../dynaform/common.js" type="text/javascript"></script>
<script language="javascript" type="text/javascript"><!--
//--></script>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<script language="javascript" src="../localiztion/char_set.js" type="text/javascript">
</script><script type="text/javascript">
var startUrl = "";
var startHelpUrl = "";
if(framePara[0] == 1)
{
    startUrl = "../userRpm/WzdStartRpm.htm";
    startHelpUrl = "../help/WzdStartHelpRpm.htm";
}
else
{
    startUrl = "../userRpm/StatusRpm.htm";
    /*changed by ZQQ, 2015.7.25, corresponding to function StatusRpmHtm*/
    if (framePara[2] == 0x08 || framePara[2] == 0x07 || framePara[2] == 0x06 || framePara[2] == 0x03)
    {
        startHelpUrl = "../help/StatusHelpRpm_AP.htm";
    }
    else if (framePara[2] == 0x04)
    {
        startHelpUrl = "../help/StatusHelpRpm_APC.htm";
    }
    else
    {
        startHelpUrl = "../help/StatusHelpRpm.htm";
    }
}
document.write("<FRAMESET rows=90,*>");
document.write("<FRAME name=topFrame marginWidth=0 marginHeight=0 src=\"../frames/top.htm\" noResize scrolling=no frameSpacing=0 frameBorder=0 id=\"topFrame\">");
document.write("<FRAMESET cols=182,55%,*>");
document.write("<FRAME name=bottomLeftFrame marginWidth=0 marginHeight=0 src=\"../userRpm/MenuRpm.htm\" noResize frameBorder=1 scrolling=auto style=\"overflow-x:hidden\" id=\"bottomLeftFrame\">");
document.write("<FRAME name=mainFrame marginWidth=0 marginHeight=0 src=" +startUrl+" frameBorder=1 id=\"mainFrame\">");
document.write("<FRAME name=helpFrame marginWidth=0 marginHeight=0 src="+startHelpUrl+" frameBorder=1 id=\"helpFrame\">");
document.write("</FRAMESET>");
</script></head>

        
    
<frameset rows="90,*"><frame name="topFrame" marginwidth="0" marginheight="0" src="../frames/top.htm" noresize="" scrolling="no" framespacing="0" frameborder="0" id="topFrame"><frameset cols="182,55%,*"><frame name="bottomLeftFrame" marginwidth="0" marginheight="0" src="../userRpm/MenuRpm.htm" noresize="" frameborder="1" scrolling="auto" style="overflow-x:hidden" id="bottomLeftFrame"><frame name="mainFrame" marginwidth="0" marginheight="0" src="../userRpm/StatusRpm.htm" frameborder="1" id="mainFrame"><frame name="helpFrame" marginwidth="0" marginheight="0" src="../help/StatusHelpRpm.htm" frameborder="1" id="helpFrame"></frameset>

<noframes>
    <body id="t_noFrame">Please upgrade to a version 4 or higher browser so that you can use this setup tool.</body>
</noframes>


</frameset>

WebElements 可能存在显着差异,如下所示查看源代码并如图所示督察工具。这两种方法都是两种不同的浏览器功能,使我们能够研究DOM Tree https://javascript.info/dom-nodes。然而它们之间的核心区别是:

  • 查看源代码显示从 AUT 传送的 HTML (测试中的应用程序)到浏览器。
  • 检查元素 is a 开发者工具 e.g. Chrome 开发工具 https://developers.google.com/web/tools/chrome-devtools看看的状态HTML DOM https://www.w3schools.com/js/js_htmldom.asp在浏览器应用其纠错之后以及在任何 Javascript 操作 DOM 之后。简而言之,使用查看源代码你会观察到JavaScript但不是HTML。 HTML 错误可能会在检查元素 tool.

因此,您会看到使用更大的输出Inspect.

您可以在中找到相关的详细讨论通过查看源代码获取显示的 Web 元素 https://stackoverflow.com/a/71699106/7429447


Solution

页面来源 https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webdriver.html#selenium.webdriver.remote.webdriver.WebDriver.page_source是最有效和经过验证的方法之一Selenium提取页面源。然而,有一个问题。你需要诱导WebDriver等待 https://stackoverflow.com/a/59130336/7429447为了元素可见性() https://stackoverflow.com/a/50474905/7429447 of a static网页内的元素。举个例子,要提取页面来源网页的https://example.com https://example.com你可以诱导WebDriver等待 for <h1>带有innerText的标签为Example Domain to be visible如下:

  • Using XPATH:

    driver.get("https://example.com")     
    WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[text()='Example Domain']")))
    print(driver.page_source())
    
  • Note:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Selenium driver.page_source() 仅提取部分 HTML DOM 的相关文章