[ https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486646#comment-16486646 ]
Luis Filipe Nassif commented on TIKA-2646: ------------------------------------------ It does not maintain table structures, but have you tried to enable sortByPosition param in tika config or PdfParserConfig? > Tika parse["content"] returns jumbled text across cells of a table in a pdf > --------------------------------------------------------------------------- > > Key: TIKA-2646 > URL: https://issues.apache.org/jira/browse/TIKA-2646 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.18 > Environment: MacOS Sierra 10.12.6 > Reporter: Annie Didier > Priority: Trivial > Labels: performance > > When text from a table is extracted, sometimes the order of the cells becomes > mixed and the words get concatenated together. For example: > > ||HOURS||DUR > (hr)||PHASE||CODE||SUB||DESCRIPTION|| > becomes: Hours Dur Code Sub DescriptionPhase > > In other more serious cases, the text within a cell becomes scrambled with a > text from another cell. Such as: > ||HOURS||DUR > (hr)||PHASE||CODE||SUB|| > |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / > TESTING|E - RIG OUT > TESTERS| > the second row becomes: > 17.00-00:00 17:00 FLOWBK E - RIG OUT > > TESTERS > > 33 P - > > FLOWBACK / > > TESTING > Note that the value of the second column has moved to the first column, and > the "-" within the first column is misordered. The last two columns have > switched places. -- This message was sent by Atlassian JIRA (v7.6.3#76005)