[jira] [Updated] (TIKA-4277) PDF parse issue for text rotated

ragebear (Jira) Thu, 11 Jul 2024 04:52:05 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ragebear updated TIKA-4277:
---------------------------
    Attachment: OtherPDFReader.png

> PDF parse issue for text rotated
> --------------------------------
>
>                 Key: TIKA-4277
>                 URL: https://issues.apache.org/jira/browse/TIKA-4277
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-app, tika-server
>    Affects Versions: 3.0.0-BETA, 2.9.2
>            Reporter: ragebear
>            Priority: Major
>         Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-4277) PDF parse issue for text rotated

Reply via email to