Re: [Pharo-users] How to detect encoding of a file

Sven Van Caekenberghe Wed, 03 May 2017 07:42:34 -0700

> On 3 May 2017, at 12:18, Sven Van Caekenberghe <s...@stfx.eu> wrote:
> 
> Hi Cyril,
> 
> I want to try to write such a detector. I'll get back to you.


I added the following (Zn #bleedingEdge):

===
Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
Author: SvenVanCaekenberghe
Time: 3 May 2017, 4:30:44.081888 pm
UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48

Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and 
unreliably guess the encoding used by a collection of bytes

Add ZnCharacterEncoderTests>>#testDetectEncoding

Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder

Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
===
Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
Author: SvenVanCaekenberghe
Time: 3 May 2017, 4:31:09.469852 pm
UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30

Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and 
unreliably guess the encoding used by a collection of bytes

Add ZnCharacterEncoderTests>>#testDetectEncoding

Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder

Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
===


Now you can do the following:

ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') 
binaryReadStreamDo: [ :in | in upToEnd ]).

(FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
        | bytes encoder |
        bytes := in upToEnd.
        encoder := ZnCharacterEncoder detectEncoding: bytes.
        encoder decodeBytes: bytes ].

It works on the test file you gave me, but this process is just a guess, a 
heuristic that is unreliable and often wrong (especially for very similar byte 
encodings), see https://en.wikipedia.org/wiki/Charset_detection.

You can give the whole contents to the detector, or just a header.

I was a bit too optimistic though, this is basically an unsolvable problem. It 
is MUCH better to somehow know up front what the encoding used is, or to know 
something useable about the contents (like the header of HTML or XML).

Sven

> Any chance you could give me (part of) a file that causes you trouble (one 
> that is legal latin1, yet does not fail utf-8 while doing it wrong in utf-8) ?
> 
> Sven
> 
>> On 3 May 2017, at 11:40, Cyril Ferlicot D. <cyril.ferli...@gmail.com> wrote:
>> 
>> Hello,
>> 
>> We have a problem using Moose because we have files which we don't know
>> the encoding. Currently we have this implementation to get the content
>> of a file:
>> 
>> completeText
>> self fileReference exists ifFalse: [ ^ '' ].
>> ^ self fileReference readStreamDo: [ :s |
>>   [ s contents ]
>>     on: Error
>>     do: [ [ s converter: Latin1TextConverter new; contents ]
>>       on: Error
>>       do: [ '' ] ] ]
>> 
>> But, we have a problem because we have currently some files at
>> Synectique in ISO-8859-1. The problem is that #contents is able to read
>> some of the files without throwing an error, but the content is not
>> right because it is not the good encoding.
>> 
>> Thus I wonder if it is possible to get the Encoding of a FileReference
>> in Pharo to be able to read the file with the right encoding? Something
>> like the bash command `file -I myFile.txt`.
>> 
>> -- 
>> Cyril Ferlicot
>> https://ferlicot.fr
>> 
>> http://www.synectique.eu
>> 2 rue Jacques Prévert 01,
>> 59650 Villeneuve d'ascq France
>> 
>

Re: [Pharo-users] How to detect encoding of a file

Reply via email to