使用 py2neo WriteBatch 将大图数据插入 Neo4j

2024-01-06

我有一个由以下文件表示的图表:

  • VertexLabel.txt -> 每行包含每个顶点的属性。
  • EdgeLabel.txt -> 每行包含每条边的属性。
  • EdgeID.txt -> 每行包含 3 个分隔的整数,对应于标签文件中的索引: 源索引 目标索引 边缘索引。

大约有 44K 个顶点和 240K 个边。我正在尝试使用Neo4j.Writebatch http://nigelsmall.com/py2neo/1.6/batches/批量插入图形数据。

from py2neo import Graph, neo4j, node, rel

graph_db = Graph()
nodes = {}
batchNodes = {}
edges = {}
edgeList = []

# Read vertex label file into nodes, where node[i] is indexed according to the order the nodes appear in the file.
# Each entry is of type node, e.g. node("FILM", title = "Star Trek"), node("CAST", name = "William Shatner")
...  

# Read edge label file into edges, where edges[i] is indexed according to the order the edges appear in the file.
# Each entry is a tuple (edge_type, edge_task), e.g. ("STAFF", "Director")
...  

# Read edge id file into edgeList
# Each entry is the tuple (source_index, target_index, edge_index), e.g. (1, 4, 8)
...  

# Iterate nodes, store in graph
# Note, store result of batch.create into batchNodes
batch = neo4j.WriteBatch(graph_db)
count = 0
for n in nodes:
    batchNodes[n] = batch.create(nodes[n])
    count += 1

    # Submit every 500 steps
    if count % 500 == 0:
        count = 0
        batch.submit()
        batch = neo4j.WriteBatch(graph_db)

# Submit remaining batch
batch.submit()

# Iterate edgeList, store in graph
batch = neo4j.WriteBatch(graph_db)
count = 0
for i, j, k in edgeList:
    # Lookup reference in batchNodes
    source = batchNodes[i]
    target = batchNodes[j]
    edge = edges[k]
    batch.create(rel(source, edge[0], target, {"task": edge[1]}))
    count += 1

    # Submit every 500 steps
    if count % 500 == 0:
        count = 0
        batch.submit()
        batch = neo4j.WriteBatch(graph_db)

# Submit remaining batch
batch.submit()

我收到以下错误:

Traceback (most recent call last):   File "test4.py", line 87, in <module>
    batch.create(rel(source, edge[0], target, {"task": edge[1]}))   File "C:\Python34\lib\site-packages\py2neo\batch\write.py", line 181, in create
    start_node = self.resolve(entity.start_node)   File "C:\Python34\lib\site-packages\py2neo\batch\core.py", line 374, in resolve
    return NodePointer(self.find(node))   File "C:\Python34\lib\site-packages\py2neo\batch\core.py", line 394, in find
    raise ValueError("Job not found in batch") ValueError: Job not found in batch

我认为batchNodes实际上并不包含对我想要查找以添加关系的节点的正确引用(可能重新初始化批处理对象会使引用无效)。在这种情况下,我该如何执行这个任务呢?

我正在使用 Neo4j 2.1.7(社区版)和 py2neo 2.0.4。


为了导入类似 CSV 的数据,我建议从 Neo4j 2.1 开始使用 LOAD CSV

load csv with headers from "file://...VertexLabel.txt" as row
where has(row.name)
create (:Actor {row.name})

同样,您可以加载您的关系

在 :Actor(name) 上创建索引; 在 :Movie(title) 上创建索引;

load csv with headers from "file://...EdgeID.txt" as row
match (a:Actor {row.name})
match (m:Movie {row.title})
create (a)-[:ACTED_IN]->(m)

从 Neo4j 2.2 开始,您还可以使用 neo4j-import 一个超级快速的工具来导入 csv 数据,它还支持 id-groups,在 csv 中提供标签和类型等。

see: http://neo4j.com/developer/guide-importing-data-and-etl/ http://neo4j.com/developer/guide-importing-data-and-etl/ and: http://neo4j.com/developer/guide-import-csv/ http://neo4j.com/developer/guide-import-csv/

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

使用 py2neo WriteBatch 将大图数据插入 Neo4j 的相关文章

随机推荐