[
https://issues.apache.org/jira/browse/TIKA-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated TIKA-392:
------------------------------------
Attachment: TIKA-392-tests.patch
Attached patch w/ 3 new test cases (all passing); it should be committable now
I think.
First case (testHexEscapingInsideWord) is from Thiago's example above.
Second case (testRTFTableCellSeparation2) just adds the original RTF in the
opening in this issue.
Third case (testWindowsCodePage1250) is from TIKA-422 - I noticed we didn't
commit the original example in that issue I think.
> RTF parser smashes words together in subsequent table cells
> -----------------------------------------------------------
>
> Key: TIKA-392
> URL: https://issues.apache.org/jira/browse/TIKA-392
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.7
>
> Attachments: TIKA-392-tests.patch
>
>
> I have an RTF document with the following snippet of content (it's an export
> of a private phone book so I can't share the full document):
> {\rtlch\fcs1 \af0\afs24 \ltrch\fcs0
> \f0\fs24\lang2055\langfe2055\langfenp2055\insrsid9461491\charrsid9461491 Fax
> / Phone Station\cell Fax / Phone #\cell }
> The extracted text is:
> Fax / Phone StationFax / Phone
> Note how the cell boundary between "Station" and "Fax" is lost.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira