[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

Gerard Bouchar (JIRA) Mon, 18 Jun 2018 01:10:42 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16515477#comment-16515477
 ]


Gerard Bouchar commented on TIKA-2671:
--------------------------------------

bq. The best solution would probably be to try each (when they differ) and 
create an out-of-vocabulary score based on tika-eval's word lists and pick the 
encoding with the lowest OOV%

bq. The HTML Standard approach fails when web servers lie (return incorrect 
HTTP response headers)

During my tests, I couldn't find a case of a web browser not respecting the 
standard. So implementing the standard cannot "fail", in the sense that it will 
return the same content a web browser would. I don't think trying to outsmart 
the browsers makes much sense, first of all because it's very difficult, and 
there is a high risk of creating new cases of pages that are displayed 
correctly in the browsers, but are parsed incorrectly in tika, and second, 
because pages should be treated as they are, and not as they should be. For 
instance, in the case of nutch, a page that, when opened in a browser, displays 
"franÃ§ais" should be indexed as "franÃ§ais" and not "français", shouldn't it ?

> HtmlEncodingDetector doesnt take provided metadata into account
> ---------------------------------------------------------------
>
>                 Key: TIKA-2671
>                 URL: https://issues.apache.org/jira/browse/TIKA-2671
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> org.apache.tika.parser.html.HtmlEncodingDetector ignores the document's 
> metadata. So when using it to detect the charset of an HTML document that 
> came with a conflicting charset specified at the transport layer level, the 
> encoding specified inside the file is used instead.
> This behavior does not conform to what is [specified by the W3C for 
> determining the character encoding of HTML 
> pages|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding].
>  This causes bugs similar to NUTCH-2599.
>  
> If HtmlEncodingDetector is not meant to take into account meta-information 
> about the document, then maybe another detector should be provided, that 
> would be a CompositeDetector including, in that order:
>  * a new, simple, MetadataEncodingDetector, that would simply return the 
> encoding
>  * the existing HtmlEncodingDetector
>  * a generic detector, like UniversalEncodingDetector



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

Reply via email to