ragebear created TIKA-4277:
------------------------------

             Summary: PDF parse issue for text rotated
                 Key: TIKA-4277
                 URL: https://issues.apache.org/jira/browse/TIKA-4277
             Project: Tika
          Issue Type: Bug
          Components: tika-app, tika-server
    Affects Versions: 2.9.2, 3.0.0-BETA
            Reporter: ragebear
         Attachments: sample2.pdf

the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta

The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
server version and the standalone.

if the text is rotated 90. The parsed result will have a line break after each 
letter of word. It happened to symbol, English letters, and JCK characters.

In the server version, curl -g -T "sample2.pdf" 
[http://localhost:889/tika]
--header "Accept: text/plain"

In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
--text

Both of above, deliver the the incorrect result in the attached pdf.

The output result is below

i
n
s
e
r
t
 
t
e
x
t
 
p
r
o
b
l
e
m

insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to