[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874491#comment-17874491 ]
Tilman Hausherr commented on TIKA-3858: --------------------------------------- Fixed in PDFBOX-5868. > Ligatures convert on text extraction > ------------------------------------- > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 > Reporter: tom hill > Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)