判断页面相似度的python实现:
两天时间实现了一小部分,大量其实是xpath解析以及字符串和文件操作,性能还需要改善,下面说一下步骤:
因为主要是解析Vue.js框架写的,DOM树是动态生成的,还有Ajax请求。所以不能通过传统的requests.get(url)
直接获取,在网上看了很多方法,有用无头浏览器的,还有好多记不清了,因为我是要为测试服务,所以我用了selenium登陆后获取cookies,再通过cookies加入获取动态DOM树。
import lxml
import time
from lxml import etree
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://localhost:9527/#/login?redirect=%2Fuser%2Fuser')
driver.find_element_by_xpath("//*[@id='app']/div/form/div[2]/div/div/input").send_keys('')
driver.find_element_by_xpath("//*[@id='app']/div/form/div[3]/div/div/input").send_keys('')
driver.find_element_by_xpath("//*[@id='app']/div/form/button").click()
time.sleep(2)
cookie_list = driver.get_cookies()
for item in cookie_list: driver.add_cookie(
{
'domain': 'localhost',
'httpOnly': False,
'name': 'X-Litemall-Admin-Token',
'path': '/',
'secure': False,
'value': '1fpzctbhesj4eqfg5e2k6gkg77m2u220'
}
)
time.sleep(5)
def get_xpath(html):
page=etree.HTML(html)
result=page.xpath('//*')
with open('D:\\code\\python\\bok-choy-master\\tests\\demo\\pathx','w') as f:
for i in result:
tree=lxml.etree.ElementTree(i)
ll=tree.getpath(i)
f.write(ll)
- 从文件中读取所有元素的xpath路径,然后找到所有叶子节点
def get_leaf_xpath():
result=[]
with open('D:\\code\\python\\bok-choy-master\\tests\\demo\\pathx','r') as f:
for line in f.readlines():
result.append(line.strip('\n'))
print(len(result))
leaf_node_xpath_list=[]
for i in range(len(result)-1):
j=i+1
one_xpath=result[i]
two_xpath=result[j]
if one_xpath in two_xpath:
one_xpath=two_xpath
two_xpath=result[j+1]
else:
leaf_node_xpath_list.append(one_xpath)
return leaf_node_xpath_list
参考(python使用lxml解析html获取页面内所有叶子节点的xpath路径)[https://blog.csdn.net/Together_CZ/article/details/74015599]
后期想用最长公共子序列实现=_=
def get_leaf_xpath():
result=[]
with open('D:\\code\\python\\bok-choy-master\\tests\\demo\\pathx','r') as f:
for line in f.readlines():
result.append(line.strip('\n'))
print(len(result))
leaf_node_xpath_list=[]
for i in range(len(result)-1):
j=i+1
one_xpath=result[i]
two_xpath=result[j]
if one_xpath in two_xpath:
one_xpath=two_xpath
two_xpath=result[j+1]
else:
leaf_node_xpath_list.append(one_xpath)
return leaf_node_xpath_list
if __name__ == '__main__':
driver.get('http://localhost:9527/#/user/user')
html1=driver.page_source
print(len(get_leaf_xpath()))
driver.close()
最后获得叶子结点数是301个,原结点数649个
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)