Hoss Man created TIKA-1134:
------------------------------

             Summary: ContentHandler gets ignorable whitespace for <br> tags 
when parsing HTML
                 Key: TIKA-1134
                 URL: https://issues.apache.org/jira/browse/TIKA-1134
             Project: Tika
          Issue Type: Bug
            Reporter: Hoss Man


I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding 
something here, but it appears that the way Tika parses HTML to produce XHTML 
SAX events is missinterpreting "<br>" tags as equivilent to ignorable 
whitespace containing a newline.  This means that clients who ask Tika to parse 
files, and specify their own ContentHandler to capture the character data can 
get sequences of run-on text w/o knowing that the "<br>" tag was present -- 
_unless_ they explicitly handle ignorbaleWhitespace and treat it as "real" 
whitespace -- but this creates a catch-22 if you really do want to ignore the 
ignorable whitespace in the HTML markup.

The crux of the problem seems to be:
 * instead of generating a startElement event for "br" the HtmlParser treats it 
as a xhtml.newline().
 * xhtml.newline() generates and ignorableWhitespace SAX event instead of a 
characters SAX event

...either one of these by themselves might be fine, but in combination they 
don't really make any sense.  If for example an actual newline exists in the 
html, it comes across as part of a characters SAX event, not as ignorbale 
whitespace.


Changing the newline() function to delegate to characters(...) seems to solve 
the problem for <br> tags in HTML, but breaks several tests -- probably because 
the newline() function is also used to add intentionally add (synthetic) 
ignorableWhitespace events after elements.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to