[ 
https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484465#comment-16484465
 ] 

Tim Allison commented on TIKA-2646:
-----------------------------------

Y, it might be painful to try to coordinate tabula and our PDFBox parser to get 
both tables and content...not only technically (not too bad), but also with 
release versions.  I'm not sure they upgrade as quickly as we do, although they 
might.

So, please do reopen if you think we should try this...

> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2646
>                 URL: https://issues.apache.org/jira/browse/TIKA-2646
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.18
>         Environment: MacOS Sierra 10.12.6
>            Reporter: Annie Didier
>            Priority: Trivial
>              Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes 
> mixed and the words get concatenated together. For example:
>  
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>  
> In other more serious cases, the text within a cell becomes scrambled with a 
> text from another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>  
> TESTERS
>  
> 33 P -
>  
> FLOWBACK /
>  
> TESTING
> Note that the value of the second column has moved to the first column, and 
> the "-" within the first column is misordered. The last two columns have 
> switched places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to