[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Hans Brende (JIRA) Tue, 06 Nov 2018 08:01:23 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676926#comment-16676926
 ]


Hans Brende commented on TIKA-2771:
-----------------------------------

One thing I am sure of, however, is that if your chances of getting a false 
positive for a given charset is *greater* than your chances of actually finding 
that charset "in the wild", then it is counterproductive to try to detect it in 
the first place.

That goes, not just for IBM500, but for anything that isn't UTF-8. Given that > 
90% of the web is UTF-8 (and the web, correct me if I'm wrong, seems to be the 
primary use-case for charset detection), a charset detector whose strategy is 
simply: {code:java}return "UTF-8";{code} is going to be at least 90% accurate. 
Source: https://w3techs.com/technologies/overview/character_encoding/all

So detection of any charsets *other than* UTF-8 needs to increase the accuracy 
to something *greater than* 90%, otherwise the false positives will actually 
*decrease* the overall accuracy! 

(I bring this up because I noticed in a different issue thread (TIKA-2038) that 
it was mentioned that Tika [is only 72% 
accurate|https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15830525&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15830525].
 Am I missing something here? Would we really get a more confident charset for 
webpages by simply guessing *everything* to be UTF-8?)

> enableInputFilter() wrecks charset detection for some short html documents
> --------------------------------------------------------------------------
>
>                 Key: TIKA-2771
>                 URL: https://issues.apache.org/jira/browse/TIKA-2771
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.19.1
>            Reporter: Hans Brende
>            Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("<!DOCTYPE html>\n" +
>         "<div>\n" +
>         "  <div itemscope itemtype=\"http://schema.org/Person\"; id=\"amanda\" 
> itemref=\"a b\"></div>\n" +
>         "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
>         "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
>         "</div>").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Reply via email to