[jira] [Commented] (TIKA-1437) encoding issue in AutoDetectReader

Tyler Palsulich (JIRA) Fri, 20 Feb 2015 22:04:04 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14330017#comment-14330017
 ]


Tyler Palsulich commented on TIKA-1437:
---------------------------------------

[~Lukeliush], can you make a couple updates to make this easier to test? First, 
come up with a small (few line) file with this problem. That way, we can be 
sure we can legally include the file within Tika. Also, can you reformat your 
testing script as a Tika JUnit TestCase? You can see an example 
[here|https://github.com/apache/tika/blob/trunk/tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java].

The file you have might just be corrupted -- giving different results. And, as 
Tim mentioned, no detector will be perfect, so different detectors will give 
different results. But, the above changes will help us narrow it down. Thanks!

> encoding issue in AutoDetectReader
> ----------------------------------
>
>                 Key: TIKA-1437
>                 URL: https://issues.apache.org/jira/browse/TIKA-1437
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.6
>         Environment: Windows 8
>            Reporter: Luke sh
>            Priority: Critical
>         Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv, 
> e9.jpg, ef.jpg
>
>
> We are having an encoding problem with Tika AutoDetectReader;
> we are using AutoDetectReader to read an stream to extract the string values 
> by calling readLine()::AutoDetectReader. We find that the Encoding problem is 
> happening in UniversalEncodingDetector being called by AutoDetectReader when 
> reading the input stream being passed as one of the arguments in our 
> TSVParser’s parse method. 
> We are using AutoDetectReader in our parser and we believed it was able auto 
> detect an correct encoding from the input stream being passed to it, but we 
> are seeing several garbled chars bubbling up in our outputted and converted 
> files from our parser; we find out that the encoding problem is happening in 
> the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is 
> reading the stream with UTF-8 which is incorrect encoding; and the correct 
> encoding is ISO-8859-1.
> I am attaching the screenshot of what char difference we are seeing in the 
> input tsv file and converted/outputed file. they are e9.jpg and ef.jpg, 
> please read the description for details.
> The problem is that the AutoDetectReader is decoding and reading the chars 
> with incorrect encoding. 
> BTW, We were able to work around this problem with CharsetDetector, which 
> seems to generate a valid encoding for the moment with which we can use to 
> read the tsv file properly.
> However, the problem is we cannot use AutoDetectReader, we have to create our 
> own TSVAutoDetectReader incorporated with CharsetDetector in the detect 
> method; AutoDetectReader class seems to be less flexible for us to extend its 
> functions, many of its methods are restricted with private constraints, we 
> cannot manually set encoding or override the existing implementation for 
> detecting encoding.
> In addition, I am also not confident about CharsetDetector either; as I am 
> seeing different encodings produced by CharsetDetector and AutoDetectReader 
> for different tsv files; But for now, we might live with CharsetDetector, as 
> CharsetDetector is solving the current encoding problem.
> Finally, I would like to also please give you my test program (PFA: 
> EncodingProblem.java) that reads an inputted tsv directory and displays a 
> list of encodings for each of the tsv files in the directory produced by 
> AutoDetectReader, UniversalEncodingDetector(which is being called by 
> AutoDetectReader) and CharsetDetector; so you could probably see the 
> difference, they are producing different encodings for some tsv files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1437) encoding issue in AutoDetectReader

Reply via email to