Alex Andrushchak created TIKA-1289: -------------------------------------- Summary: Ligatures convert on text extraction Key: TIKA-1289 URL: https://issues.apache.org/jira/browse/TIKA-1289 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: win 8, jre 1.5 Reporter: Alex Andrushchak
According to tika sources review, it uses pdfbox to parse pdf files. I found that pdfbox itself uses icu4j to handle ligatures. Unfortunately, when i added icu4j jar to my classpath nothing changed, ligatures are still not converted. Sample pdf file is attached. -- This message was sent by Atlassian JIRA (v6.2#6252)