[
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119038#comment-13119038
]
Robert Muir commented on TIKA-721:
----------------------------------
{quote}
Finally, for the valid code points, I count how many times each
unicode block had a character; usually a doc will be a in single
language and have high percentage of its chars from a single block (I
think!?).
{quote}
I don't think this is a good idea: languages like japanese use multiple blocks,
and many writing
systems (e.g. cyrillic/arabic/etc) tend to use ascii digits and punctuation...
{quote}
If I decode to a Unicode code point, I then call Java's
Character.isDefined to see if it's really valid
{quote}
I don't think this is that great either: e.g. java 6 supports a very old
version of the unicode standard (4.x) and that method will return false for any
completely valid newer unicode characters.
> UTF16-LE not detected
> ---------------------
>
> Key: TIKA-721
> URL: https://issues.apache.org/jira/browse/TIKA-721
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: Chinese_Simplified_utf16.txt, TIKA-721.patch
>
>
> I have a test file encoded in UTF16-LE, but Tika fails to detect it.
> Note that it is missing the BOM, which is not allowed (for UTF16-BE
> the BOM is optional).
> Not sure we can realistically fix this; I have no idea how...
> Here's what Tika detects:
> {noformat}
> windows-1250: confidence=9
> windows-1250: confidence=7
> windows-1252: confidence=7
> windows-1252: confidence=6
> windows-1252: confidence=5
> IBM420_ltr: confidence=4
> windows-1252: confidence=3
> windows-1254: confidence=2
> windows-1250: confidence=2
> windows-1252: confidence=2
> IBM420_rtl: confidence=1
> windows-1253: confidence=1
> windows-1250: confidence=1
> windows-1252: confidence=1
> windows-1252: confidence=1
> windows-1252: confidence=1
> windows-1252: confidence=1
> windows-1252: confidence=1
> {noformat}
> The test file decodes fine as UTF16-LE; eg in Python just run this:
> {noformat}
> import codecs
> codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira