[ 
https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486646#comment-16486646
 ] 

Luis Filipe Nassif commented on TIKA-2646:
------------------------------------------

It does not maintain table structures, but have you tried to enable 
sortByPosition param in tika config or PdfParserConfig?

> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2646
>                 URL: https://issues.apache.org/jira/browse/TIKA-2646
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.18
>         Environment: MacOS Sierra 10.12.6
>            Reporter: Annie Didier
>            Priority: Trivial
>              Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes 
> mixed and the words get concatenated together. For example:
>  
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>  
> In other more serious cases, the text within a cell becomes scrambled with a 
> text from another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>  
> TESTERS
>  
> 33 P -
>  
> FLOWBACK /
>  
> TESTING
> Note that the value of the second column has moved to the first column, and 
> the "-" within the first column is misordered. The last two columns have 
> switched places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to