[Bug 59870] FILEOPEN PDF: Incorrect text encoding

bugzilla-daemon Fri, 11 Apr 2025 08:44:04 -0700

https://bugs.documentfoundation.org/show_bug.cgi?id=59870


--- Comment #27 from Khaled Hosny <[email protected]> ---
(In reply to Eyal Rozenberg from comment #26)
> (In reply to Khaled Hosny from comment #25)
> > The PDF metadata shows that it was produce by Ghostscript. The PDF font
> > dictionaries contain no ToUnicode CMaps, nor do they use any standard PDF
> > font encoding. As such there are no much textual data that can be extracted
> > from the PDF.
> 
> But the text is _there_... I'm no PDF expert (nor even have a decent tool
> for exploring PDF files' raw structure), but - if the encoding is
> iso-8859-1, or something similar - should we not be able to figure this out?
> Especially given the hint of lack-of-CMaps, rather than jarbled CMaps?

PDF text stream is often contains glyph indices (from subset font) and
positions. The glyph indices are arbitrary and differ from font subset to font
subset. A PDF font subset contains at most 256 glyphs. When there is no
ToUnicode CMap for a given font, PDF tools will assume the glyph indices are
codepoints and will try to use them for text extraction, and that the garbled
text you are seeing. For example, (what appears to be when looking at the PDF)
the string “Nr.” (at the top left corner), is encoded in the PDF as:

<002200CF>41.1893<00BD>

The hex numbers are glyph IDs and the decimal number is kerning. The hex
numbers mean glyph index 34 (0x0022), glyph index 207 (0x00CF), and glyph index
189.

If the font had a ToUnicode CMap, it would have mapped 0x0022 to “N”, 0x00CF to
“r”, and 0x00BD to “.”, but there isn’t and when interpreting these numbers as
codepoints we get:

"Ï½

Which just makes no sense. There is no text encoding where “"Ï½” is “Nr.”, and
even if there one it will be a pure coincidence and the next string or the next
font will be broken.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 59870] FILEOPEN PDF: Incorrect text encoding

Reply via email to