如何查找 PDF 中所有出现的特定文本并在上方插入分页符?


我对 PDF 有一个棘手的要求

我需要在 pdf 中搜索特定字符串 - 属性编号:


我可以访问 IText 和 Spire.PDF,我首先查看 IText

我从这里的其他帖子中确定我需要使用 PDF Stamper



var newFile = @"c:\temp\full.pdf";
var dest = @"c:\temp\dest.pdf";
var reader = new PdfReader(newFile);
if (File.Exists(dest))

var stamper = new PdfStamper(reader, new FileStream(dest, FileMode.CreateNew));
var total = reader.NumberOfPages + 1;
for (var pageNumber = total; pageNumber > 0; pageNumber--)
  var pageContent = reader.GetPageContent(pageNumber);
  stamper.InsertPage(pageNumber, PageSize.A4);


下图显示了一个示例,因此这实际上是 3 页,即现有页面,在第一次出现的属性编号上方插入一个新分页符:


enter image description here

这个答案分享了一个概念验证查找 PDF 中所有出现的特定文本并在上方插入分页符使用 iText 和 Java。将其移植到 iTextSharp 和 C# 应该不会太困难。



  1. 一些自定义文本位置的提取策略和
  2. 剪切页面的工具。


为了提取自定义文本的位置,我们扩展了 iTextLocationTextExtractionStrategy还允许提取自定义文本文本字符串的位置,实际上是正则表达式的匹配项:

public class SearchTextLocationExtractionStrategy extends LocationTextExtractionStrategy {
    public SearchTextLocationExtractionStrategy(Pattern pattern) {
        super(new TextChunkLocationStrategy() {
            public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline) {
                // while baseLine has been changed to not neutralize
                // effects of rise, ascentLine and descentLine explicitly
                // have not: We want the actual positions.
                return new AscentDescentTextChunkLocation(baseline, renderInfo.getAscentLine(),
                        renderInfo.getDescentLine(), renderInfo.getSingleSpaceWidth());
        this.pattern = pattern;

    static Field locationalResultField = null;
    static Method filterTextChunksMethod = null;
    static Method startsWithSpaceMethod = null;
    static Method endsWithSpaceMethod = null;
    static Field textChunkTextField = null;
    static Method textChunkSameLineMethod = null;
    static {
        try {
            locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            filterTextChunksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("filterTextChunks",
                    List.class, TextChunkFilter.class);
            startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace",
            endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
            textChunkTextField = TextChunk.class.getDeclaredField("text");
            textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
        } catch (NoSuchFieldException | SecurityException | NoSuchMethodException e) {
            // TODO Auto-generated catch block

    public Collection<TextRectangle> getLocations(TextChunkFilter chunkFilter) {
        Collection<TextRectangle> result = new ArrayList<>();
        try {
            List<TextChunk> filteredTextChunks = (List<TextChunk>) filterTextChunksMethod.invoke(this,
                    locationalResultField.get(this), chunkFilter);

            StringBuilder sb = new StringBuilder();
            List<AscentDescentTextChunkLocation> locations = new ArrayList<>();
            TextChunk lastChunk = null;
            for (TextChunk chunk : filteredTextChunks) {
                String chunkText = (String) textChunkTextField.get(chunk);
                if (lastChunk == null) {
                    // Nothing to compare with at the end
                } else if ((boolean) textChunkSameLineMethod.invoke(chunk, lastChunk)) {
                    // we only insert a blank space if the trailing character of the previous string
                    // wasn't a space,
                    // and the leading character of the current string isn't a space
                    if (isChunkAtWordBoundary(chunk, lastChunk)
                            && !((boolean) startsWithSpaceMethod.invoke(this, chunkText))
                            && !((boolean) endsWithSpaceMethod.invoke(this, chunkText))) {
                        sb.append(' ');
                        LineSegment spaceBaseLine = new LineSegment(lastChunk.getEndLocation(),
                        locations.add(new AscentDescentTextChunkLocation(spaceBaseLine, spaceBaseLine, spaceBaseLine,
                } else {
                    assert sb.length() == locations.size();
                    Matcher matcher = pattern.matcher(sb);
                    while (matcher.find()) {
                        int i = matcher.start();
                        Vector baseStart = locations.get(i).getStartLocation();
                        TextRectangle textRectangle = new TextRectangle(matcher.group(), baseStart.get(Vector.I1),
                        for (; i < matcher.end(); i++) {
                            AscentDescentTextChunkLocation location = locations.get(i);


                locations.add((AscentDescentTextChunkLocation) chunk.getLocation());
                lastChunk = chunk;
        } catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e) {
            // TODO Auto-generated catch block
        return result;

    public void renderText(TextRenderInfo renderInfo) {
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())

    public static class AscentDescentTextChunkLocation extends TextChunkLocationDefaultImp {
        public AscentDescentTextChunkLocation(LineSegment baseLine, LineSegment ascentLine, LineSegment descentLine,
                float charSpaceWidth) {
            super(baseLine.getStartPoint(), baseLine.getEndPoint(), charSpaceWidth);
            this.ascentLine = ascentLine;
            this.descentLine = descentLine;

        public LineSegment getAscentLine() {
            return ascentLine;

        public LineSegment getDescentLine() {
            return descentLine;

        final LineSegment ascentLine;
        final LineSegment descentLine;

    public class TextRectangle extends Rectangle2D.Float {
        public TextRectangle(final String text, final float xStart, final float yStart) {
            super(xStart, yStart, 0, 0);
            this.text = text;

        public String getText() {
            return text;

        final String text;

    final Pattern pattern;




该工具的页面分割功能已从PdfVeryDenseMergeTool from 这个答案。此外,允许自定义分页位置是抽象的。

public abstract class AbstractPdfPageSplittingTool {
    public AbstractPdfPageSplittingTool(Rectangle size, float top) {
        this.pageSize = size;
        this.topMargin = top;

    public void split(OutputStream outputStream, PdfReader... inputs) throws DocumentException, IOException {
        try {
            for (PdfReader reader : inputs) {
        } finally {

    void openDocument(OutputStream outputStream) throws DocumentException {
        final Document document = new Document(pageSize, 36, 36, topMargin, 36);
        final PdfWriter writer = PdfWriter.getInstance(document, outputStream);
        this.document = document;
        this.writer = writer;

    void closeDocument() {
        try {
        } finally {
            this.document = null;
            this.writer = null;
            this.yPosition = 0;

    void newPage() {
        yPosition = pageSize.getTop(topMargin);

    void split(PdfReader reader) throws IOException {
        for (int page = 1; page <= reader.getNumberOfPages(); page++) {
            split(reader, page);

    void split(PdfReader reader, int page) throws IOException
        PdfImportedPage importedPage = writer.getImportedPage(reader, page);
        PdfContentByte directContent = writer.getDirectContent();
        yPosition = pageSize.getTop();

        Rectangle pageSizeToImport = reader.getPageSize(page);
        float[] borderPositions = determineSplitPositions(reader, page);
        if (borderPositions == null || borderPositions.length < 2)

        for (int borderIndex = 0; borderIndex + 1 < borderPositions.length; borderIndex++) {
            float height = borderPositions[borderIndex] - borderPositions[borderIndex + 1];
            if (height <= 0)

            directContent.rectangle(0, yPosition - height, pageSizeToImport.getWidth(), height);

            writer.getDirectContent().addTemplate(importedPage, 0, yPosition - (borderPositions[borderIndex] - pageSizeToImport.getBottom()));


    protected abstract float[] determineSplitPositions(PdfReader reader, int page);

    Document document = null;
    PdfWriter writer = null;
    float yPosition = 0;

    final Rectangle pageSize;
    final float topMargin;




我需要在 pdf 中搜索特定字符串 - 属性编号:



AbstractPdfPageSplittingTool tool = new AbstractPdfPageSplittingTool(PageSize.A4, 36) {
    protected float[] determineSplitPositions(PdfReader reader, int page) {
        Collection<TextRectangle> locations = Collections.emptyList();
        try {
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);
            SearchTextLocationExtractionStrategy strategy = new SearchTextLocationExtractionStrategy(
                    Pattern.compile("Property Number"));
            parser.processContent(page, strategy, Collections.emptyMap()).getResultantText();
            locations = strategy.getLocations(null);
        } catch (IOException e) {
            // TODO Auto-generated catch block

        List<Float> borders = new ArrayList<>();
        for (TextRectangle rectangle : locations)

        Rectangle pageSize = reader.getPageSize(page);
        Collections.sort(borders, Collections.reverseOrder());

        float[] result = new float[borders.size()];
        for (int i=0; i < result.length; i++)
            result[i] = borders.get(i);
        return result;

tool.split(new FileOutputStream(RESULT), new PdfReader(SOURCE));



