如何查找 PDF 中所有出现的特定文本并在上方插入分页符？

2023-12-10

我对 PDF 有一个棘手的要求

我需要在 pdf 中搜索特定字符串 - 属性编号：

每次找到这个，我都需要在上面添加一个分页符

我可以访问 IText 和 Spire.PDF，我首先查看 IText

我从这里的其他帖子中确定我需要使用 PDF Stamper

下面的逻辑添加了一个有效的新页面

但是，就我而言，我只需要分页符而不是空白页

var newFile = @"c:\temp\full.pdf";
var dest = @"c:\temp\dest.pdf";
var reader = new PdfReader(newFile);
if (File.Exists(dest))
{
  File.Delete(dest);
}

var stamper = new PdfStamper(reader, new FileStream(dest, FileMode.CreateNew));
var total = reader.NumberOfPages + 1;
for (var pageNumber = total; pageNumber > 0; pageNumber--)
{
  var pageContent = reader.GetPageContent(pageNumber);
  stamper.InsertPage(pageNumber, PageSize.A4);
}

stamper.Close();
reader.Close();

下图显示了一个示例，因此这实际上是 3 页，即现有页面，在第一次出现的属性编号上方插入一个新分页符：

在第二次出现的上方需要另一个分页符

这个答案分享了一个概念验证查找 PDF 中所有出现的特定文本并在上方插入分页符使用 iText 和 Java。将其移植到 iTextSharp 和 C# 应该不会太困难。

此外，对于生产使用，必须添加一些额外的代码，因为当前代码做出了一些假设，例如假定非旋转页面。此外，它根本不处理注释。

该任务实际上是两个任务的组合，finding和插入分页符，因此我们需要

一些自定义文本位置的提取策略和
剪切页面的工具。

搜索文本位置提取策略

为了提取自定义文本的位置，我们扩展了 iTextLocationTextExtractionStrategy还允许提取自定义文本文本字符串的位置，实际上是正则表达式的匹配项：

public class SearchTextLocationExtractionStrategy extends LocationTextExtractionStrategy {
    public SearchTextLocationExtractionStrategy(Pattern pattern) {
        super(new TextChunkLocationStrategy() {
            public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline) {
                // while baseLine has been changed to not neutralize
                // effects of rise, ascentLine and descentLine explicitly
                // have not: We want the actual positions.
                return new AscentDescentTextChunkLocation(baseline, renderInfo.getAscentLine(),
                        renderInfo.getDescentLine(), renderInfo.getSingleSpaceWidth());
            }
        });
        this.pattern = pattern;
    }

    static Field locationalResultField = null;
    static Method filterTextChunksMethod = null;
    static Method startsWithSpaceMethod = null;
    static Method endsWithSpaceMethod = null;
    static Field textChunkTextField = null;
    static Method textChunkSameLineMethod = null;
    static {
        try {
            locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            locationalResultField.setAccessible(true);
            filterTextChunksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("filterTextChunks",
                    List.class, TextChunkFilter.class);
            filterTextChunksMethod.setAccessible(true);
            startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace",
                    String.class);
            startsWithSpaceMethod.setAccessible(true);
            endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
            endsWithSpaceMethod.setAccessible(true);
            textChunkTextField = TextChunk.class.getDeclaredField("text");
            textChunkTextField.setAccessible(true);
            textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
            textChunkSameLineMethod.setAccessible(true);
        } catch (NoSuchFieldException | SecurityException | NoSuchMethodException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    public Collection<TextRectangle> getLocations(TextChunkFilter chunkFilter) {
        Collection<TextRectangle> result = new ArrayList<>();
        try {
            List<TextChunk> filteredTextChunks = (List<TextChunk>) filterTextChunksMethod.invoke(this,
                    locationalResultField.get(this), chunkFilter);
            Collections.sort(filteredTextChunks);

            StringBuilder sb = new StringBuilder();
            List<AscentDescentTextChunkLocation> locations = new ArrayList<>();
            TextChunk lastChunk = null;
            for (TextChunk chunk : filteredTextChunks) {
                String chunkText = (String) textChunkTextField.get(chunk);
                if (lastChunk == null) {
                    // Nothing to compare with at the end
                } else if ((boolean) textChunkSameLineMethod.invoke(chunk, lastChunk)) {
                    // we only insert a blank space if the trailing character of the previous string
                    // wasn't a space,
                    // and the leading character of the current string isn't a space
                    if (isChunkAtWordBoundary(chunk, lastChunk)
                            && !((boolean) startsWithSpaceMethod.invoke(this, chunkText))
                            && !((boolean) endsWithSpaceMethod.invoke(this, chunkText))) {
                        sb.append(' ');
                        LineSegment spaceBaseLine = new LineSegment(lastChunk.getEndLocation(),
                                chunk.getStartLocation());
                        locations.add(new AscentDescentTextChunkLocation(spaceBaseLine, spaceBaseLine, spaceBaseLine,
                                chunk.getCharSpaceWidth()));
                    }
                } else {
                    assert sb.length() == locations.size();
                    Matcher matcher = pattern.matcher(sb);
                    while (matcher.find()) {
                        int i = matcher.start();
                        Vector baseStart = locations.get(i).getStartLocation();
                        TextRectangle textRectangle = new TextRectangle(matcher.group(), baseStart.get(Vector.I1),
                                baseStart.get(Vector.I2));
                        for (; i < matcher.end(); i++) {
                            AscentDescentTextChunkLocation location = locations.get(i);
                            textRectangle.add(location.getAscentLine().getBoundingRectange());
                            textRectangle.add(location.getDescentLine().getBoundingRectange());
                        }

                        result.add(textRectangle);
                    }

                    sb.setLength(0);
                    locations.clear();
                }
                sb.append(chunkText);
                locations.add((AscentDescentTextChunkLocation) chunk.getLocation());
                lastChunk = chunk;
            }
        } catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return result;
    }

    @Override
    public void renderText(TextRenderInfo renderInfo) {
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
            super.renderText(info);
    }

    public static class AscentDescentTextChunkLocation extends TextChunkLocationDefaultImp {
        public AscentDescentTextChunkLocation(LineSegment baseLine, LineSegment ascentLine, LineSegment descentLine,
                float charSpaceWidth) {
            super(baseLine.getStartPoint(), baseLine.getEndPoint(), charSpaceWidth);
            this.ascentLine = ascentLine;
            this.descentLine = descentLine;
        }

        public LineSegment getAscentLine() {
            return ascentLine;
        }

        public LineSegment getDescentLine() {
            return descentLine;
        }

        final LineSegment ascentLine;
        final LineSegment descentLine;
    }

    public class TextRectangle extends Rectangle2D.Float {
        public TextRectangle(final String text, final float xStart, final float yStart) {
            super(xStart, yStart, 0, 0);
            this.text = text;
        }

        public String getText() {
            return text;
        }

        final String text;
    }

    final Pattern pattern;
}

(SearchTextLocationExtractionStrategy.java)

由于基类中一些必要的成员是私有的或包私有的，我们必须使用反射来提取它们。

AbstractPdf页面分割工具

该工具的页面分割功能已从PdfVeryDenseMergeTool from 这个答案。此外，允许自定义分页位置是抽象的。

public abstract class AbstractPdfPageSplittingTool {
    public AbstractPdfPageSplittingTool(Rectangle size, float top) {
        this.pageSize = size;
        this.topMargin = top;
    }

    public void split(OutputStream outputStream, PdfReader... inputs) throws DocumentException, IOException {
        try {
            openDocument(outputStream);
            for (PdfReader reader : inputs) {
                split(reader);
            }
        } finally {
            closeDocument();
        }
    }

    void openDocument(OutputStream outputStream) throws DocumentException {
        final Document document = new Document(pageSize, 36, 36, topMargin, 36);
        final PdfWriter writer = PdfWriter.getInstance(document, outputStream);
        document.open();
        this.document = document;
        this.writer = writer;
        newPage();
    }

    void closeDocument() {
        try {
            document.close();
        } finally {
            this.document = null;
            this.writer = null;
            this.yPosition = 0;
        }
    }

    void newPage() {
        document.newPage();
        yPosition = pageSize.getTop(topMargin);
    }

    void split(PdfReader reader) throws IOException {
        for (int page = 1; page <= reader.getNumberOfPages(); page++) {
            split(reader, page);
        }
    }

    void split(PdfReader reader, int page) throws IOException
    {
        PdfImportedPage importedPage = writer.getImportedPage(reader, page);
        PdfContentByte directContent = writer.getDirectContent();
        yPosition = pageSize.getTop();

        Rectangle pageSizeToImport = reader.getPageSize(page);
        float[] borderPositions = determineSplitPositions(reader, page);
        if (borderPositions == null || borderPositions.length < 2)
            return;

        for (int borderIndex = 0; borderIndex + 1 < borderPositions.length; borderIndex++) {
            float height = borderPositions[borderIndex] - borderPositions[borderIndex + 1];
            if (height <= 0)
                continue;

            directContent.saveState();
            directContent.rectangle(0, yPosition - height, pageSizeToImport.getWidth(), height);
            directContent.clip();
            directContent.newPath();

            writer.getDirectContent().addTemplate(importedPage, 0, yPosition - (borderPositions[borderIndex] - pageSizeToImport.getBottom()));

            directContent.restoreState();
            newPage();
        }
    }

    protected abstract float[] determineSplitPositions(PdfReader reader, int page);

    Document document = null;
    PdfWriter writer = null;
    float yPosition = 0;

    final Rectangle pageSize;
    final float topMargin;
}

(AbstractPdfPageSplittingTool.java)

音乐会中的使用

执行OP的任务：

我需要在 pdf 中搜索特定字符串 - 属性编号：

每次找到这个，我都需要在上面添加一个分页符

可以像这样使用上面的类：

AbstractPdfPageSplittingTool tool = new AbstractPdfPageSplittingTool(PageSize.A4, 36) {
    @Override
    protected float[] determineSplitPositions(PdfReader reader, int page) {
        Collection<TextRectangle> locations = Collections.emptyList();
        try {
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);
            SearchTextLocationExtractionStrategy strategy = new SearchTextLocationExtractionStrategy(
                    Pattern.compile("Property Number"));
            parser.processContent(page, strategy, Collections.emptyMap()).getResultantText();
            locations = strategy.getLocations(null);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        List<Float> borders = new ArrayList<>();
        for (TextRectangle rectangle : locations)
        {
            borders.add((float)rectangle.getMaxY());
        }

        Rectangle pageSize = reader.getPageSize(page);
        borders.add(pageSize.getTop());
        borders.add(pageSize.getBottom());
        Collections.sort(borders, Collections.reverseOrder());

        float[] result = new float[borders.size()];
        for (int i=0; i < result.length; i++)
            result[i] = borders.get(i);
        return result;
    }
};

tool.split(new FileOutputStream(RESULT), new PdfReader(SOURCE));

(SplitPages.java测试方法testSplitDocumentAboveAngestellter)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

c

pdf

itext