[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated

ragebear (Jira) Fri, 12 Jul 2024 00:00:10 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865331#comment-17865331
 ]


ragebear commented on TIKA-4277:
--------------------------------

thanks, very helpful. It would be great to add the above in [PDFParser (Apache 
PDFBox) - TIKA - Apache Software Foundation. 
|https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)]

The following titles were there, but the solution is NOT listed.
h3. No Text
h3. Mildly Garbled Text
h3. Completely Garbled Text
h3. No spaces/Extra spaces

See 1c. above. Depending on how the PDF was generated, it is possible that it 
doesn't store actual space characters. Rather software has to use coordinates 
on the page plus matrix algebra plus font information about the width of 
characters to "impute" where spaces would be. The math is the simple part; 
sometimes there can be missing or wrong font information that can lead to no 
spaces or extra spaces.
h3. {color:#FF0000}Word/Line breaks in the middle of my text ?!{color}
h3. Character Encoding/Unicode Mappings

> PDF parse issue for text rotated
> --------------------------------
>
>                 Key: TIKA-4277
>                 URL: https://issues.apache.org/jira/browse/TIKA-4277
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-app, tika-server
>    Affects Versions: 3.0.0-BETA, 2.9.2
>            Reporter: ragebear
>            Priority: Major
>              Labels: config.xml
>         Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated

Reply via email to