你提到我之前的回答之一 https://stackoverflow.com/a/35987635/1729265作为 PDFBox 的示例,它不适合您。事实上,正如该答案中已经解释的那样,令人惊讶的是看到代码与单个单词之外的任何内容匹配,因为那里覆盖的例程的调用者给人一种逐字调用它的印象。因此,确实很难找到任何跨越一行的东西。
但是我们可以以一种非常自然的方式改进该示例,以允许跨行边界搜索,假设行在空格处分开。替换方法findSubwords
通过这个改进版本:
List<TextPositionSequence> findSubwordsImproved(PDDocument document, int page, String searchTerm) throws IOException
{
final List<TextPosition> allTextPositions = new ArrayList<>();
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
allTextPositions.addAll(textPositions);
super.writeString(text, textPositions);
}
@Override
protected void writeLineSeparator() throws IOException {
if (!allTextPositions.isEmpty()) {
TextPosition last = allTextPositions.get(allTextPositions.size() - 1);
if (!" ".equals(last.getUnicode())) {
Matrix textMatrix = last.getTextMatrix().clone();
textMatrix.setValue(2, 0, last.getEndX());
textMatrix.setValue(2, 1, last.getEndY());
TextPosition separatorSpace = new TextPosition(last.getRotation(), last.getPageWidth(), last.getPageHeight(),
textMatrix, last.getEndX(), last.getEndY(), last.getHeight(), 0, last.getWidthOfSpace(), " ",
new int[] {' '}, last.getFont(), last.getFontSize(), (int) last.getFontSizeInPt());
allTextPositions.add(separatorSpace);
}
}
super.writeLineSeparator();
}
};
stripper.setSortByPosition(true);
stripper.setStartPage(page);
stripper.setEndPage(page);
stripper.getText(document);
final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
TextPositionSequence word = new TextPositionSequence(allTextPositions);
String string = word.toString();
int fromIndex = 0;
int index;
while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
{
hits.add(word.subSequence(index, index + searchTerm.length()));
fromIndex = index + 1;
}
return hits;
}
(搜索子词 https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/test/java/mkl/testarea/pdfbox2/extract/SearchSubword.java#L129 method)
我们在这里收集所有TextPosition
条目,实际上,每当 PDFBox 添加换行符时,我们甚至会添加代表空格的虚拟条目。一旦整个页面呈现出来,我们就搜索所有这些文本位置的集合。
应用于示例文档 https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/test/resources/mkl/testarea/pdfbox2/extract/Variables.pdf在原来的问题中,
寻找"${var 2}"
现在返回所有 8 次出现,以及那些跨行分割的:
* Looking for '${var 2}' (improved)
Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
Page 1 at 164.39648, 357.28998 with width -46.081444 and last letter '}' at 112.46, 372.65
Page 1 at 174.97762, 388.72998 with width -56.662575 and last letter '}' at 112.46, 404.09
Page 1 at 153.74, 420.16998 with width -32.004005 and last letter '}' at 112.46, 435.65
Page 1 at 162.99922, 451.61 with width -43.692017 and last letter '}' at 112.46, 467.21
出现负宽度是因为匹配结束的 x 坐标小于开始的 x 坐标。