[
https://issues.apache.org/jira/browse/PDFBOX-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5875:
------------------------------------
Fix Version/s: (was: 3.0.4 PDFBox)
> using font data to process ligatures
> ------------------------------------
>
> Key: PDFBOX-5875
> URL: https://issues.apache.org/jira/browse/PDFBOX-5875
> Project: PDFBox
> Issue Type: New Feature
> Components: Parsing, PDModel, Text extraction
> Affects Versions: 3.0.3 PDFBox
> Reporter: Manish S N
> Priority: Major
> Labels: Asian, CIDFont, font, ligatures, unicodemapping
> Attachments: page.pdf
>
>
> To process ligatures from Asian languages (where a glyph is the combination
> of two unicode characters) using the data in embedded fonts.
>
> *The problem:*
> currently modern PDF creators put these ligatures in /ActualText field which
> we only recently considered to support in this issue . But this is not the
> case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of
> ligatures lack a /toUnicode character mapping because there is no single
> unicode codepoint for these as these are combination of more than one unicode
> characters.
>
> *The Potential Solution (if not perfect):*
> I managed to extract the font files using pdfbox
> ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java])
> and when i viewed the fontfiles using fontforge i found the data about
> ligatures intact in it. So we can use this data to map the glyphs that are
> ligatures to the unicodes of its constituent glyphs
>
> *Problems:*
> In some cases the constituent glyphs may not be present in the cmap at all.
> removed by PDF optimiser as it is never directly used in the PDF apart from
> in ligatures. such glyphs are empty with only glyph id and no /toUnicode
> mapping even if that particular glyph has a corresponding unicode character.
>
> *The Hope:*
> This is not a common problem in large PDFs. and basic spell checkers could
> easily rectify the problem. some comprehension is better than no
> comprehension when it comes to dealing with data. this will greatly enhance
> the parsing of non-Latin Asian languages.
>
> (the PDF sample i attached is in Tamil language)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]