[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865331#comment-17865331 ]
ragebear commented on TIKA-4277: -------------------------------- thanks, very helpful. It would be great to add the above in [PDFParser (Apache PDFBox) - TIKA - Apache Software Foundation. |https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)] The following titles were there, but the solution is NOT listed. h3. No Text h3. Mildly Garbled Text h3. Completely Garbled Text h3. No spaces/Extra spaces See 1c. above. Depending on how the PDF was generated, it is possible that it doesn't store actual space characters. Rather software has to use coordinates on the page plus matrix algebra plus font information about the width of characters to "impute" where spaces would be. The math is the simple part; sometimes there can be missing or wrong font information that can lead to no spaces or extra spaces. h3. {color:#FF0000}Word/Line breaks in the middle of my text ?!{color} h3. Character Encoding/Unicode Mappings > PDF parse issue for text rotated > -------------------------------- > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server > Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: ragebear > Priority: Major > Labels: config.xml > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)