PDFBox - 直线/矩形提取


我正在尝试从 PDF 中提取文本坐标和行(或矩形)坐标。

The TextPosition班级有getXDirAdj() and getYDirAdj()根据相应 TextPosition 对象表示的文本片段的方向转换坐标的方法(根据 @mkl 的评论进行更正) 无论页面旋转如何,最终输出都是一致的。

输出所需的坐标是 X0,Y0(页面的左上角)

这是对 @Tilman Hausherr 的解决方案的轻微修改。 y 坐标被反转(高度 - y)以使其与文本提取过程中的坐标保持一致,并且输出也写入 csv。

    public class LineCatcher extends PDFGraphicsStreamEngine
    private static final GeneralPath linePath = new GeneralPath();
    private static ArrayList<Rectangle2D> rectList= new ArrayList<Rectangle2D>();
    private int clipWindingRule = -1;
    private static String headerRecord = "Text|Page|x|y|width|height|space|font";

    public LineCatcher(PDPage page)

    public static void main(String[] args) throws IOException
        if( args.length != 4 )
            PDDocument document = null;
            FileOutputStream fop = null;
            File file;
            Writer osw = null;
            int numPages;
            double page_height;
                document = PDDocument.load( new File(args[0], args[1]) );
                numPages = document.getNumberOfPages();
                file = new File(args[2], args[3]);
                fop = new FileOutputStream(file);

                // if file doesnt exists, then create it
                if (!file.exists()) {

                osw = new OutputStreamWriter(fop, "UTF8");
                osw.write(headerRecord + System.lineSeparator());
                System.out.println("Line Processing numPages:" + numPages);
                for (int n = 0; n < numPages; n++) {
                    System.out.println("Line Processing page:" + n);
                    rectList = new ArrayList<Rectangle2D>();
                    PDPage page = document.getPage(n);
                    page_height = page.getCropBox().getUpperRightY();
                    LineCatcher lineCatcher = new LineCatcher(page);

                        for(Rectangle2D rect:rectList) {

                            String pageNum = Integer.toString(n + 1);
                            String x = Double.toString(rect.getX());
                            String y = Double.toString(page_height - rect.getY()) ;
                            String w = Double.toString(rect.getWidth());
                            String h = Double.toString(rect.getHeight());
                            writeToFile(pageNum, x, y, w, h, osw);

                        rectList = null;
                        page = null;
                        lineCatcher = null;
                    catch(IOException io){
                        throw new IOException("Failed to Parse document for line processing. Incorrect document format. Page:" + n);

            catch(IOException io){
                throw new IOException("Failed to Parse document for line processing. Incorrect document format.");
                if ( osw != null ){
                if( document != null )

    private static void writeToFile(String pageNum, String x, String y, String w, String h, Writer osw) throws IOException {
        String c = "^" + "|" +
                pageNum + "|" +
                x + "|" +
                y + "|" +
                w + "|" +
                h + "|" +
                "999" + "|" +
        osw.write(c + System.lineSeparator());

    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
        // to ensure that the path is created in the right direction, we have to create
        // it by combining single lines instead of creating a simple rectangle
        linePath.moveTo((float) p0.getX(), (float) p0.getY());
        linePath.lineTo((float) p1.getX(), (float) p1.getY());
        linePath.lineTo((float) p2.getX(), (float) p2.getY());
        linePath.lineTo((float) p3.getX(), (float) p3.getY());

        // close the subpath instead of adding the last line so that a possible set line
        // cap style isn't taken into account at the "beginning" of the rectangle

    public void drawImage(PDImage pdi) throws IOException

    public void clip(int windingRule) throws IOException
        // the clipping path will not be updated until the succeeding painting operator is called
        clipWindingRule = windingRule;


    public void moveTo(float x, float y) throws IOException
        linePath.moveTo(x, y);

    public void lineTo(float x, float y) throws IOException
        linePath.lineTo(x, y);

    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
        linePath.curveTo(x1, y1, x2, y2, x3, y3);

    public Point2D getCurrentPoint() throws IOException
        return linePath.getCurrentPoint();

    public void closePath() throws IOException

    public void endPath() throws IOException
        if (clipWindingRule != -1)
            clipWindingRule = -1;


    public void strokePath() throws IOException

    public void fillPath(int windingRule) throws IOException

    public void fillAndStrokePath(int windingRule) throws IOException

    public void shadingFill(COSName cosn) throws IOException

     * This will print the usage for this document.
    private static void usage()
        System.err.println( "Usage: java " + LineCatcher.class.getName() + " <input-pdf>"  + " <output-file>");


PLOT of the text and line extract

绿色:文本 红色:按原样获取的线坐标 黑色:预期坐标(对输出应用变换后获得)


使用 PDFBox 获得旋转并获得一致的线/矩形坐标输出有哪些可能的选项?

As far as I understand the requirements here, the OP works in a coordinate system with the origin in the upper left corner of the visible page (taking the page rotation into account), x coordinates increasing to the right, y coordinates increasing downwards, and the units being the PDF default user space units (usually 1/72 inch).


  • 左/上端点坐标和
  • 宽度/高度。

转型LineCatcher results



for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x = Double.toString(rect.getX());
    String y = Double.toString(page_height - rect.getY()) ;
    String w = Double.toString(rect.getWidth());
    String h = Double.toString(rect.getHeight());
    writeToFile(pageNum, x, y, w, h, osw);


int pageRotation = page.getRotation();
PDRectangle pageCropBox = page.getCropBox();

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x, y, w, h;
    switch(pageRotation) {
    case 0:
        x = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        y = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
    case 90:
        x = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        y = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
    case 180:
        x = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        y = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
    case 270:
        x = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        y = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        throw new IOException(String.format("Unsupported page rotation %d on page %d.", pageRotation, page));
    writeToFile(pageNum, x, y, w, h, osw);

(带目录的提取行 test testExtractLineRotationTestWithDir)


OP 通过参考描述坐标TextPosition类方法getXDirAdj() and getYDirAdj()。事实上,这些方法返回坐标系中的坐标,原点位于页面左上角,并且y坐标向下增加旋转页面以使文本垂直绘制后.




  • PDFBox - 直线/矩形提取

    我正在尝试从 PDF 中提取文本坐标和行 或矩形 坐标 The TextPosition班级有getXDirAdj and getYDirAdj 根据相应 TextPosition 对象表示的文本片段的方向转换坐标的方法 根据 mkl 的评