[jira] [Commented] (ANY23-554) Avoid using carriage return to detect windows-1252 charset if content type has been identified from metadata

Hans Brende (Jira) Wed, 05 Jan 2022 01:02:18 -0800


    [ 
https://issues.apache.org/jira/browse/ANY23-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469135#comment-17469135
 ]


Hans Brende commented on ANY23-554:
-----------------------------------

A couple thoughts here:

1. ISO-8859-1 and windows-1252 are actually not incompatible, but synonyms, as 
defined by the HTML WHATWG specification: 
https://encoding.spec.whatwg.org/#ref-for-windows-1252%E2%91%A0

2. It is very common to mislabel Windows-1252 text as ISO-8859-1 (see 
https://en.wikipedia.org/wiki/Windows-1252 )

3. As mentioned in the comment from the linked code, the \r heuristic was 
copied from Tika's implementation so it has solid precedent

4. Labels are also heuristics... the question is, which heuristic should rank 
higher? The charset label heuristic should win sometimes, but not always due to 
the prevalence of mislabeled content on the web. For example, we'd definitely 
want to assign a byte-order mark higher priority than a label, *especially* in 
HTML markdown, since it is actually illegal to declare any meta encoding 
*except* UTF-8 in an HTML document! So one could say that the document is 
*already* malformed having a meta tag that differs from UTF-8. (See WHATWG: 
https://html.spec.whatwg.org/#charset).

> Avoid using carriage return to detect windows-1252 charset if content type 
> has been identified from metadata
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ANY23-554
>                 URL: https://issues.apache.org/jira/browse/ANY23-554
>             Project: Apache Any23
>          Issue Type: Task
>            Reporter: Peter Ansell
>            Priority: Major
>
> Two encoding detection tests are failing on Windows and Windows Subsystem for 
> Linux due to a condition that overrides a meta tag with a heuristic, which is 
> not likely correct in its current form as carriage returns are present in 
> many different Windows produced documents, which may legitimately follow 
> ISO-8859-1.
> If someone has put a meta tag in with ISO-8859-1, we shouldn't be using the 
> presence of carriage return characters overriding that with an incompatible 
> windows specific codepage, windows-1252.
> The relevant code is:
> https://github.com/apache/any23/blob/any23-2.6/encoding/src/main/java/org/apache/any23/encoding/EncodingUtils.java#L62-L69
> The tests that are failing on Windows and WSL2 are:
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TikaEncodingDetectorTest.testISO8859HTML:58->assertEncoding:128
> Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
> [ERROR]   TikaEncodingDetectorTest.testISO8859XHTML:63->assertEncoding:128
> Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
> [INFO]
> [ERROR] Tests run: 12, Failures: 2, Errors: 0, Skipped: 0
> [INFO]
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] Reactor Summary for Apache Any23 2.6:
> [INFO]
> [INFO] Apache Any23 ....................................... SUCCESS [01:57 
> min]
> [INFO] Apache Any23 :: Base API ........................... SUCCESS [ 56.016 
> s]
> [INFO] Apache Any23 :: Test Resources ..................... SUCCESS [  1.068 
> s]
> [INFO] Apache Any23 :: CSV Utilities ...................... SUCCESS [  2.759 
> s]
> [INFO] Apache Any23 :: Mime Type Detection ................ SUCCESS [01:10 
> min]
> [INFO] Apache Any23 :: Encoding Detection ................. FAILURE [  4.160 
> s]
> [INFO] Apache Any23 :: Core ............................... SKIPPED
> [INFO] Apache Any23 :: CLI ................................ SKIPPED
> [INFO] 
> ------------------------------------------------------------------------



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ANY23-554) Avoid using carriage return to detect windows-1252 charset if content type has been identified from metadata

Reply via email to