Open
Description
I'm trying to extract text from a pdf together with the position of the text.
When I do it in pypdf 3.16 I get the expected result, but I don't in 3.17.
Environment
Windows-10-10.0.19045-SP0
pypdf==3.16.0, crypt_provider=('cryptography', '41.0.3'), PIL=9.5.0
AND
pypdf==3.17.3, crypt_provider=('cryptography', '41.0.7'), PIL=9.5.0
Code + PDF
This is a minimal, complete example that shows the issue:
import pypdf
file_path = "list.pdf"
reader = pypdf.PdfReader(file_path)
text_parts = []
def visitor(text, cm, tm, fd, fs):
if text.strip() == "Flyttesagsnr.:":
text_parts.append((cm, tm, text))
reader.pages[0].extract_text(visitor_text=visitor)
print(text_parts)
Unfourtunately I can't share the PDF since it's confidential. I haven't been able to declassify the document and keep the bug.
I know this might make the bug hard to replicate.
Results
In version 3.17 I get:
[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], ' Flyttesagsnr.:')]
In version 3.16 I get:
[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, -1.0, 448.313, 352.05], ' Flyttesagsnr.:')]
As you can see tm[4] and tm[5] are both 0 in version 3.17, which is definitely wrong.