ragebear created TIKA-4277: ------------------------------ Summary: PDF parse issue for text rotated Key: TIKA-4277 URL: https://issues.apache.org/jira/browse/TIKA-4277 Project: Tika Issue Type: Bug Components: tika-app, tika-server Affects Versions: 2.9.2, 3.0.0-BETA Reporter: ragebear Attachments: sample2.pdf
the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in server version and the standalone. if the text is rotated 90. The parsed result will have a line break after each letter of word. It happened to symbol, English letters, and JCK characters. In the server version, curl -g -T "sample2.pdf" [http://localhost:889/tika] --header "Accept: text/plain" In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" --text Both of above, deliver the the incorrect result in the attached pdf. The output result is below i n s e r t t e x t p r o b l e m insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)