[ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14330017#comment-14330017 ]
Tyler Palsulich commented on TIKA-1437: --------------------------------------- [~Lukeliush], can you make a couple updates to make this easier to test? First, come up with a small (few line) file with this problem. That way, we can be sure we can legally include the file within Tika. Also, can you reformat your testing script as a Tika JUnit TestCase? You can see an example [here|https://github.com/apache/tika/blob/trunk/tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java]. The file you have might just be corrupted -- giving different results. And, as Tim mentioned, no detector will be perfect, so different detectors will give different results. But, the above changes will help us narrow it down. Thanks! > encoding issue in AutoDetectReader > ---------------------------------- > > Key: TIKA-1437 > URL: https://issues.apache.org/jira/browse/TIKA-1437 > Project: Tika > Issue Type: Bug > Components: detector, parser > Affects Versions: 1.6 > Environment: Windows 8 > Reporter: Luke sh > Priority: Critical > Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv, > e9.jpg, ef.jpg > > > We are having an encoding problem with Tika AutoDetectReader; > we are using AutoDetectReader to read an stream to extract the string values > by calling readLine()::AutoDetectReader. We find that the Encoding problem is > happening in UniversalEncodingDetector being called by AutoDetectReader when > reading the input stream being passed as one of the arguments in our > TSVParser’s parse method. > We are using AutoDetectReader in our parser and we believed it was able auto > detect an correct encoding from the input stream being passed to it, but we > are seeing several garbled chars bubbling up in our outputted and converted > files from our parser; we find out that the encoding problem is happening in > the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is > reading the stream with UTF-8 which is incorrect encoding; and the correct > encoding is ISO-8859-1. > I am attaching the screenshot of what char difference we are seeing in the > input tsv file and converted/outputed file. they are e9.jpg and ef.jpg, > please read the description for details. > The problem is that the AutoDetectReader is decoding and reading the chars > with incorrect encoding. > BTW, We were able to work around this problem with CharsetDetector, which > seems to generate a valid encoding for the moment with which we can use to > read the tsv file properly. > However, the problem is we cannot use AutoDetectReader, we have to create our > own TSVAutoDetectReader incorporated with CharsetDetector in the detect > method; AutoDetectReader class seems to be less flexible for us to extend its > functions, many of its methods are restricted with private constraints, we > cannot manually set encoding or override the existing implementation for > detecting encoding. > In addition, I am also not confident about CharsetDetector either; as I am > seeing different encodings produced by CharsetDetector and AutoDetectReader > for different tsv files; But for now, we might live with CharsetDetector, as > CharsetDetector is solving the current encoding problem. > Finally, I would like to also please give you my test program (PFA: > EncodingProblem.java) that reads an inputted tsv directory and displays a > list of encodings for each of the tsv files in the directory produced by > AutoDetectReader, UniversalEncodingDetector(which is being called by > AutoDetectReader) and CharsetDetector; so you could probably see the > difference, they are producing different encodings for some tsv files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)