> Am 03.05.2017 um 18:10 schrieb Cyril Ferlicot D. <cyril.ferli...@gmail.com>:
> 
>> Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit :
>> 
>>> On 3 May 2017, at 12:18, Sven Van Caekenberghe <s...@stfx.eu> wrote:
>>> 
>>> Hi Cyril,
>>> 
>>> I want to try to write such a detector. I'll get back to you.
>> 
>> I added the following (Zn #bleedingEdge):
>> 
>> ===
>> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
>> Author: SvenVanCaekenberghe
>> Time: 3 May 2017, 4:30:44.081888 pm
>> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
>> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48
>> 
>> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and 
>> unreliably guess the encoding used by a collection of bytes
>> 
>> Add ZnCharacterEncoderTests>>#testDetectEncoding
>> 
>> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>> 
>> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
>> ===
>> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
>> Author: SvenVanCaekenberghe
>> Time: 3 May 2017, 4:31:09.469852 pm
>> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
>> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30
>> 
>> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and 
>> unreliably guess the encoding used by a collection of bytes
>> 
>> Add ZnCharacterEncoderTests>>#testDetectEncoding
>> 
>> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>> 
>> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
>> ===
>> 
>> 
>> Now you can do the following:
>> 
>> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') 
>> binaryReadStreamDo: [ :in | in upToEnd ]).
>> 
>> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
>>    | bytes encoder |
>>    bytes := in upToEnd.
>>    encoder := ZnCharacterEncoder detectEncoding: bytes.
>>    encoder decodeBytes: bytes ].
>> 
>> It works on the test file you gave me, but this process is just a guess, a 
>> heuristic that is unreliable and often wrong (especially for very similar 
>> byte encodings), see https://en.wikipedia.org/wiki/Charset_detection.
>> 
>> You can give the whole contents to the detector, or just a header.
>> 
>> I was a bit too optimistic though, this is basically an unsolvable problem. 
>> It is MUCH better to somehow know up front what the encoding used is, or to 
>> know something useable about the contents (like the header of HTML or XML).
>> 
>> Sven
>> 
> 
> Thank you! I'll try this tomorrow. If it works well I wonder if we can
> still includes it in Pharo6. Since it's only a little feature unused in
> Pharo it should not break anything but it would be cool addition for Moose.
> 
> But since it is feature freeze if people do not want I'll not push it
> for Pharo 6 :)
> 
It shouldn't be included. There no such thing as side-effect-free change. Moose 
can load a newer version of zinc. That is how it is supposed to be.

Norbert
> -- 
> Cyril Ferlicot
> https://ferlicot.fr
> 
> http://www.synectique.eu
> 2 rue Jacques Prévert 01,
> 59650 Villeneuve d'ascq France
> 


Reply via email to