On 03/05/2017 16:41, Sven Van Caekenberghe wrote:
> I added the following (Zn #bleedingEdge):
> 
> ===
> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
> Author: SvenVanCaekenberghe
> Time: 3 May 2017, 4:30:44.081888 pm
> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48
> 
> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and 
> unreliably guess the encoding used by a collection of bytes
> 
> Add ZnCharacterEncoderTests>>#testDetectEncoding
> 
> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
> 
> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
> ===
> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
> Author: SvenVanCaekenberghe
> Time: 3 May 2017, 4:31:09.469852 pm
> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30
> 
> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and 
> unreliably guess the encoding used by a collection of bytes
> 
> Add ZnCharacterEncoderTests>>#testDetectEncoding
> 
> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
> 
> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
> ===
> 
> 
> Now you can do the following:
> 
> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') 
> binaryReadStreamDo: [ :in | in upToEnd ]).
> 
> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
>       | bytes encoder |
>       bytes := in upToEnd.
>       encoder := ZnCharacterEncoder detectEncoding: bytes.
>       encoder decodeBytes: bytes ].
> 
> It works on the test file you gave me, but this process is just a guess, a 
> heuristic that is unreliable and often wrong (especially for very similar 
> byte encodings), see https://en.wikipedia.org/wiki/Charset_detection.
> 
> You can give the whole contents to the detector, or just a header.
> 
> I was a bit too optimistic though, this is basically an unsolvable problem. 
> It is MUCH better to somehow know up front what the encoding used is, or to 
> know something useable about the contents (like the header of HTML or XML).
> 
> Sven
> 

Hi,

It seems to guess right in our case and it correct the problems we saw.

Thank you for this! We will integrate it to our tools when the
configuration will be updated.

-- 
Cyril Ferlicot
https://ferlicot.fr

http://www.synectique.eu
2 rue Jacques Prévert 01,
59650 Villeneuve d'ascq France

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to