[
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675828#comment-16675828
]
Hans Brende edited comment on TIKA-2771 at 11/5/18 10:44 PM:
-------------------------------------------------------------
[[email protected]] Ah, you're correct as regards the byteMap. The TODO
comment threw me.
However, on closer inspection of the IBM500 byteMap, I see an even more
alarming issue: Only 0x40 should map to 0x20, but the byteMap actually maps 118
out of the 256 bytes map to 0x20!!! (Including 0x20 itself, which is *not* a
space, but rather a control character in IBM500!)
This could explain why so many false positives for IBM500 are occurring: *all
special characters are mapped to spaces, which are then simply ignored by the
n-gram detector*. But in order to have accurate n-gram measurements, those
special characters need to be included in the calculations, I believe. Perhaps
they should be mapped to 0x00 instead of 0x20?
was (Author: hansbrende):
[[email protected]] Ah, you're correct as regards the byteMap. The TODO
comment threw me.
However, on closer inspection of the IBM500 byteMap, I see an even more
alarming issue: 118 out of the 256 bytes map to 0x20!!!
But only 0x40 should map to 0x20.
This could explain why so many false positives for IBM500 are occurring: *all
special characters are mapped to spaces, and then simply ignored*. But in order
to have accurate n-gram measurements, those special characters need to be
included in the calculations, I believe. I'm not sure, but perhaps they should
be mapped to 0x00 instead of 0x20?
> enableInputFilter() wrecks charset detection for some short html documents
> --------------------------------------------------------------------------
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.19.1
> Reporter: Hans Brende
> Priority: Critical
>
> When I try to run the CharsetDetector on
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange
> most confident result of "IBM500" with a confidence of 60 when I enable the
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("<!DOCTYPE html>\n" +
> "<div>\n" +
> " <div itemscope itemtype=\"http://schema.org/Person\" id=\"amanda\"
> itemref=\"a b\"></div>\n" +
> " <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
> " <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
> "</div>").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even
> worse, with UTF-8 falling from a confidence of 57 to 15.
> This is screwing up 1 out of 84 of my online microdata extraction tests over
> in Any23 (as that particular page is being rendered into complete gibberish),
> so I had to implement some hacky workarounds which I'd like to remove if
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)