如何使用PDFBOX确定文本的人工粗体样式、人工斜体样式和人工轮廓样式

2024-05-03

我正在使用 PDFBox 来验证 pdf 文档。检查 PDF 中存在的以下类型的文本有一定要求

人工粗体样式文本
人造斜体样式文本。
人工轮廓样式文本

我在 PDFBOX api 列表中进行了搜索，但找不到此类 api。

任何人都可以帮助我并告诉我如何使用 PDFBOX 确定 PDF 中存在的不同类型的人工字体/文本样式。

一般过程和 PDFBox 问题

理论上，应该从派生一个类开始PDFTextStripper并重写它的方法：

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

然后你的覆盖应该使用List<TextPosition> textPositions而不是String text; each TextPosition本质上代表单个字母以及绘制该字母时激活的图形状态的信息。

不幸的是textPositions列表确实not包含当前版本 1.8.3 中的正确内容。例如。对于“这是普通文本”这一行。从您的 PDF 中获取方法writeString被调用四次，对于字符串“This”、“is”、“normal”和“text”各调用一次。不幸的是textPositions每次列表包含TextPosition最后一个字符串“text”的字母的实例。

这实际上证明已经被识别为PDFBox问题PDFBOX-1804 https://issues.apache.org/jira/browse/PDFBOX-1804同时，该问题已在 1.8.4 和 2.0.0 版本中得到解决。

话虽如此，一旦您有了修复后的 PDFBox 版本，您就可以检查一些人工样式，如下所示：

人工斜体文本

该文本样式在页面内容中创建如下：

BT
/F0 1 Tf
24 0 5.10137 24 66 695.5877 Tm
0 Tr
[<03>]TJ
...

相关部分发生在设置文本矩阵中Tm。 5.10137 是文本剪切的因子。

当您检查一个TextPosition textPosition如上所述，您可以使用以下方式查询该值

textPosition.getTextPos().getValue(1, 0)

如果该值相应地大于 0.0，则会出现人造斜体。如果它相对小于 0.0，则会出现人工向后斜体。

人工粗体或轮廓文本

这些人造风格使用不同渲染模式的双打印字母；例如大写“T”，如果是粗体：

0 0 0 1 k
...
BT
/F0 1 Tf 
24 0 0 24 66.36 729.86 Tm 
<03>Tj 
4 M 0.72 w 
0 0 Td 
1 Tr 
0 0 0 1 K
<03>Tj
ET

（即首先以常规模式绘制字母，填充字母区域，然后以轮廓模式绘制，沿字母边框绘制一条线，均为黑色，CMYK 0, 0, 0, 1；这留下了字比较粗。）

如果是大纲：

BT
/F0 1 Tf
24 0 0 24 66 661.75 Tm
0 0 0 0 k
<03>Tj
/GS1 gs
4 M 0.288 w 
0 0 Td
1 Tr
0 0 0 1 K
<03>Tj
ET

（即首先以常规模式白色绘制字母，CMYK 0, 0, 0, 0，填充字母区域，然后以轮廓模式绘制，沿字母边框绘制一条线，以黑色，CMYK 0, 0, 0 , 1；这会留下白底黑字轮廓的印象。）

不幸的是PDFBoxPDFTextStripper不跟踪文本渲染模式。此外，它显式地删除在大约相同位置出现的重复字符。因此，识别这些人为风格的任务并不重要。

如果你真的需要这样做，你就必须改变TextPosition还包含渲染模式，PDFStreamEngine将其添加到生成的TextPosition实例，以及PDFTextStripper to not删除重复的字形processTextPosition.

更正

I wrote

不幸的是PDFBoxPDFTextStripper不跟踪文本渲染模式。

这并不完全正确，你可以找到current渲染模式使用getGraphicsState().getTextState().getRenderingMode()。这意味着在processTextPosition您确实有可用的渲染模式，并且可以尝试存储给定的渲染模式（和颜色！）信息TextPosition某处，例如在一些Map<TextPosition, ...>，以供以后使用。

此外，它显式地删除在大约相同位置出现的重复字符。

您可以通过调用禁用此功能setSuppressDuplicateOverlappingText(false).

通过这两项更改，您也应该能够进行所需的测试来检查人工粗体和轮廓。

如果您尽早存储并检查样式，则后一种更改甚至可能没有必要processTextPosition.

如何检索渲染模式和颜色

正如中提到的更正确实如此is可以通过将信息收集在一个文件中来检索渲染模式和颜色信息processTextPosition覆盖。

对此，OP 评论说

抚摸和非抚摸颜色始终为黑色

起初这有点令人惊讶，但在查看之后PDFTextStripper.properties（从中初始化了文本提取期间支持的运算符），原因就清楚了：

# The following operators are not relevant to text extraction,
# so we can silently ignore them.
...
K
k

因此，在此上下文中，颜色设置运算符（尤其是本文档中的 CMYK 颜色的运算符）将被忽略！幸运的是，这些运算符的实现PageDrawer也可以在这种情况下使用。

因此，以下概念验证显示了如何检索所有必需的信息。

public class TextWithStateStripperSimple extends PDFTextStripper
{
    public TextWithStateStripperSimple() throws IOException {
        super();
        setSuppressDuplicateOverlappingText(false);
        registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
        registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
        strokingColor.put(text, getGraphicsState().getStrokingColor());
        nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());

        super.processTextPosition(text);
    }

    Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
    Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
    Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        writeString(text + '\n');

        for (TextPosition textPosition: textPositions)
        {
            StringBuilder textBuilder = new StringBuilder();
            textBuilder.append(textPosition.getCharacter())
                       .append(" - shear by ")
                       .append(textPosition.getTextPos().getValue(1, 0))
                       .append(" - ")
                       .append(textPosition.getX())
                       .append(" ")
                       .append(textPosition.getY())
                       .append(" - ")
                       .append(renderingMode.get(textPosition))
                       .append(" - ")
                       .append(toString(strokingColor.get(textPosition)))
                       .append(" - ")
                       .append(toString(nonStrokingColor.get(textPosition)))
                       .append('\n');
            writeString(textBuilder.toString());
        }
    }

    String toString(PDColorState colorState)
    {
        if (colorState == null)
            return "null";
        StringBuilder builder = new StringBuilder();
        for (float f: colorState.getColorSpaceValue())
        {
            builder.append(' ')
                   .append(f);
        }

        return builder.toString();
    }
}

使用这个你可以得到句点“.”在普通文本中为：

. - shear by 0.0 - 256.5701 88.6875 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

在人工粗体文本中，您会得到；

. - shear by 0.0 - 378.86 122.140015 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
. - shear by 0.0 - 378.86002 122.140015 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

人工斜体：

. - shear by 5.10137 - 327.121 156.4123 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

并在人工轮廓中：

. - shear by 0.0 - 357.25 190.25 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0
. - shear by 0.0 - 357.25 190.25 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0

现在，您已经了解了识别这些人工样式所需的所有信息。现在您只需分析数据即可。

顺便说一句，看看人工粗体的情况：坐标可能并不总是相同，而只是非常相似。因此，测试两个文本位置对象是否描述相同的位置需要一定的宽容度。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)