问题描述
python中用docx库读取word文件,若word文件中包含合并的表格表格
则通过docx读取显示:
file = docx.Document(path)
for table in file.tables:
for row in table.rows:
for cell in row.cells:
print(cell.text)
结果为:
1-1
1-1
1-3
1-4
2-1
2-2
2-3
2-4
3-1
3-1
3-3
2-4
3-1
3-1
4-3
4-3
合并的单元格会重复显示,如1-1会显示两次;
如果在循环中改变cell.text内容,则保存后会重复显示
for table in file.tables:
for row in table.rows:
for cell in row.cells:
cell.text = cell.text + 'test'
file.save(path2)
解决方案
打印cell发现合并的单元格虽然重复但公用内存地址:
<docx.table._Cell at 0xbbeb1d0>,
<docx.table._Cell at 0xbbeb1d0>,
<docx.table._Cell at 0xbbeb630>,
<docx.table._Cell at 0xbbeb400>,
<docx.table._Cell at 0xbbeb9b0>,
<docx.table._Cell at 0xbbeb9e8>,
<docx.table._Cell at 0xbbeb240>,
<docx.table._Cell at 0xbbeb710>,
<docx.table._Cell at 0xbbeb780>,
<docx.table._Cell at 0xbbeb780>,
<docx.table._Cell at 0xbbebac8>,
<docx.table._Cell at 0xbbeba90>,
<docx.table._Cell at 0xbbeba20>,
<docx.table._Cell at 0xbbeba20>,
<docx.table._Cell at 0xbbebba8>,
<docx.table._Cell at 0xbbebbe0>
所以,可先判断cell是否重复再修改cell.text,
代码:
cell_set = []
for table in file.tables:
for row in table.rows:
for cell in row.cells:
if cell not in cell_set:
cell_set.append(cell)
cell.text = cell.text + 'test'
执行结果:
补充:
按行打印时,按行合并的单元格地址相同,按列合并的单元格地址还是不相同,所以还需按列找出重复的单元格,再与按行找到的单元格合并去重,才能解决这个问题:
row_cells, column_cells = [], []
index = []
width, length = len(table.columns), len(table.rows)
k = 0
for row in table.rows:
for cell in row.cells:
if cell not in row_cells:
index.append([k//width, k%width])
row_cells.append(cell)
k += 1
k = 0
for column in table.columns:
for cell in column.cells:
if cell not in column_cells:
column_cells.append(cell)
elif [k%length, k//length] in index:
index.remove([k%length, k//length])
k += 1
# index即为找到的单元格索引
for i in index:
table.rows[i[0]].cells[i[1]].text += 'test'
执行结果:
之所以没有直接按索引搜索cell添加到row_cells和colums_cells中是因为直接按索引查找的cell合并的单元格地址也不相同
虽然方法繁琐,但总算可用,如有好的解决方案欢迎讨论~