[ https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424866#comment-13424866 ]
Markus Jelsma edited comment on TIKA-961 at 7/30/12 2:03 PM: ------------------------------------------------------------- Patch for 1.3 adding ignorableWhitespace if the last character is no whitespace. This is a problem for non-whitespace languages that have, for example, an anchor embedded in a word. This problem also exists in the current non-boilerpipe HTML content handler. The problem with an embedded element in a string does not exist if markup is not included but i couldn't manage to figure out how to replice Boilerpipe's proper handling of these edge-cases. was (Author: markus17): Patch for 1.3 adding ignorableWhitespace if the last character is no whitespace. This is a problem for non-whitespace languages that have, for example, an anchor embedded in a word. This problem also exists in the current non-boilerpipe HTML content handler. The problem with an embedded element in a string does not exist if markup is not included but i couldn't manage to figure out how to replice Boilerpipe's proper handler of these edge-cases. > No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true) > ---------------------------------------------------------------------- > > Key: TIKA-961 > URL: https://issues.apache.org/jira/browse/TIKA-961 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.2 > Reporter: Markus Jelsma > Fix For: 1.3 > > Attachments: TIKA-961-1.3-1.patch > > > ignorableWhitespace is not properly added when using the > BoilerpipeContentHandler and if markus is included. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira