Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit :
> 
>> On 3 May 2017, at 12:18, Sven Van Caekenberghe <s...@stfx.eu> wrote:
>>
>> Hi Cyril,
>>
>> I want to try to write such a detector. I'll get back to you.
> 
> I added the following (Zn #bleedingEdge):
> 
> ===
> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
> Author: SvenVanCaekenberghe
> Time: 3 May 2017, 4:30:44.081888 pm
> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48
> 
> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and 
> unreliably guess the encoding used by a collection of bytes
> 
> Add ZnCharacterEncoderTests>>#testDetectEncoding
> 
> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
> 
> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
> ===
> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
> Author: SvenVanCaekenberghe
> Time: 3 May 2017, 4:31:09.469852 pm
> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30
> 
> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and 
> unreliably guess the encoding used by a collection of bytes
> 
> Add ZnCharacterEncoderTests>>#testDetectEncoding
> 
> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
> 
> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
> ===
> 
> 
> Now you can do the following:
> 
> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') 
> binaryReadStreamDo: [ :in | in upToEnd ]).
> 
> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
>       | bytes encoder |
>       bytes := in upToEnd.
>       encoder := ZnCharacterEncoder detectEncoding: bytes.
>       encoder decodeBytes: bytes ].
> 
> It works on the test file you gave me, but this process is just a guess, a 
> heuristic that is unreliable and often wrong (especially for very similar 
> byte encodings), see https://en.wikipedia.org/wiki/Charset_detection.
> 
> You can give the whole contents to the detector, or just a header.
> 
> I was a bit too optimistic though, this is basically an unsolvable problem. 
> It is MUCH better to somehow know up front what the encoding used is, or to 
> know something useable about the contents (like the header of HTML or XML).
> 
> Sven
> 

Thank you! I'll try this tomorrow. If it works well I wonder if we can
still includes it in Pharo6. Since it's only a little feature unused in
Pharo it should not break anything but it would be cool addition for Moose.

But since it is feature freeze if people do not want I'll not push it
for Pharo 6 :)

-- 
Cyril Ferlicot
https://ferlicot.fr

http://www.synectique.eu
2 rue Jacques Prévert 01,
59650 Villeneuve d'ascq France

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to