https://bugs.documentfoundation.org/show_bug.cgi?id=59870
--- Comment #27 from Khaled Hosny <[email protected]> --- (In reply to Eyal Rozenberg from comment #26) > (In reply to Khaled Hosny from comment #25) > > The PDF metadata shows that it was produce by Ghostscript. The PDF font > > dictionaries contain no ToUnicode CMaps, nor do they use any standard PDF > > font encoding. As such there are no much textual data that can be extracted > > from the PDF. > > But the text is _there_... I'm no PDF expert (nor even have a decent tool > for exploring PDF files' raw structure), but - if the encoding is > iso-8859-1, or something similar - should we not be able to figure this out? > Especially given the hint of lack-of-CMaps, rather than jarbled CMaps? PDF text stream is often contains glyph indices (from subset font) and positions. The glyph indices are arbitrary and differ from font subset to font subset. A PDF font subset contains at most 256 glyphs. When there is no ToUnicode CMap for a given font, PDF tools will assume the glyph indices are codepoints and will try to use them for text extraction, and that the garbled text you are seeing. For example, (what appears to be when looking at the PDF) the string “Nr.” (at the top left corner), is encoded in the PDF as: <002200CF>41.1893<00BD> The hex numbers are glyph IDs and the decimal number is kerning. The hex numbers mean glyph index 34 (0x0022), glyph index 207 (0x00CF), and glyph index 189. If the font had a ToUnicode CMap, it would have mapped 0x0022 to “N”, 0x00CF to “r”, and 0x00BD to “.”, but there isn’t and when interpreting these numbers as codepoints we get: "Ͻ Which just makes no sense. There is no text encoding where “"Ͻ” is “Nr.”, and even if there one it will be a pure coincidence and the next string or the next font will be broken. -- You are receiving this mail because: You are the assignee for the bug.
