> Am 03.05.2017 um 18:10 schrieb Cyril Ferlicot D. <cyril.ferli...@gmail.com>: > >> Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit : >> >>> On 3 May 2017, at 12:18, Sven Van Caekenberghe <s...@stfx.eu> wrote: >>> >>> Hi Cyril, >>> >>> I want to try to write such a detector. I'll get back to you. >> >> I added the following (Zn #bleedingEdge): >> >> === >> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49 >> Author: SvenVanCaekenberghe >> Time: 3 May 2017, 4:30:44.081888 pm >> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc >> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48 >> >> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and >> unreliably guess the encoding used by a collection of bytes >> >> Add ZnCharacterEncoderTests>>#testDetectEncoding >> >> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder >> >> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: >> === >> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31 >> Author: SvenVanCaekenberghe >> Time: 3 May 2017, 4:31:09.469852 pm >> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc >> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30 >> >> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and >> unreliably guess the encoding used by a collection of bytes >> >> Add ZnCharacterEncoderTests>>#testDetectEncoding >> >> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder >> >> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: >> === >> >> >> Now you can do the following: >> >> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') >> binaryReadStreamDo: [ :in | in upToEnd ]). >> >> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | >> | bytes encoder | >> bytes := in upToEnd. >> encoder := ZnCharacterEncoder detectEncoding: bytes. >> encoder decodeBytes: bytes ]. >> >> It works on the test file you gave me, but this process is just a guess, a >> heuristic that is unreliable and often wrong (especially for very similar >> byte encodings), see https://en.wikipedia.org/wiki/Charset_detection. >> >> You can give the whole contents to the detector, or just a header. >> >> I was a bit too optimistic though, this is basically an unsolvable problem. >> It is MUCH better to somehow know up front what the encoding used is, or to >> know something useable about the contents (like the header of HTML or XML). >> >> Sven >> > > Thank you! I'll try this tomorrow. If it works well I wonder if we can > still includes it in Pharo6. Since it's only a little feature unused in > Pharo it should not break anything but it would be cool addition for Moose. > > But since it is feature freeze if people do not want I'll not push it > for Pharo 6 :) > It shouldn't be included. There no such thing as side-effect-free change. Moose can load a newer version of zinc. That is how it is supposed to be.
Norbert > -- > Cyril Ferlicot > https://ferlicot.fr > > http://www.synectique.eu > 2 rue Jacques Prévert 01, > 59650 Villeneuve d'ascq France >