[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833745#comment-17833745 ]
Tim Allison edited comment on TIKA-4231 at 4/3/24 9:18 PM: ----------------------------------------------------------- On some PDFs, there can be problems with Unicode mappings and other glyph/font issues. For some of these files, they render well but the underlying electronic text is junk. In those cases, OCR is the best option. I haven’t looked at this pdf and don’t know if the above is the case for this one. was (Author: talli...@mitre.org): On some PDFs, there can be problems with Unicode mappings and other glyph issues. For some of these files, they render well but the underlying electronic text is junk. In those cases, OCR is the best option. I haven’t looked at this pdf and don’t know if the above is the case for this one. > Parsing Arabic PDF is returning bad data > ---------------------------------------- > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug > Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > > Reporter: Aamir > Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)