Skip to content

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16. #2353

Open
@ghbm-itk

Description

@ghbm-itk

I'm trying to extract text from a pdf together with the position of the text.
When I do it in pypdf 3.16 I get the expected result, but I don't in 3.17.

Environment

Windows-10-10.0.19045-SP0
pypdf==3.16.0, crypt_provider=('cryptography', '41.0.3'), PIL=9.5.0
AND
pypdf==3.17.3, crypt_provider=('cryptography', '41.0.7'), PIL=9.5.0

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
file_path = "list.pdf"
reader = pypdf.PdfReader(file_path)

text_parts = []

def visitor(text, cm, tm, fd, fs):
    if text.strip() == "Flyttesagsnr.:":
        text_parts.append((cm, tm, text))

reader.pages[0].extract_text(visitor_text=visitor)

print(text_parts)

Unfourtunately I can't share the PDF since it's confidential. I haven't been able to declassify the document and keep the bug.
I know this might make the bug hard to replicate.

Results

In version 3.17 I get:

[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], ' Flyttesagsnr.:')]

In version 3.16 I get:

[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, -1.0, 448.313, 352.05], ' Flyttesagsnr.:')]

As you can see tm[4] and tm[5] are both 0 in version 3.17, which is definitely wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-pdfThe issue needs a PDF file to show the problemworkflow-advanced-text-extractionGetting coordinates, font weight, font type, ...

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions