Hi sven this is cool. We are always losing time for crlf/lf/cr.... I lost most of the time in the SRT2VTT on that part. will you add a little paragraph to the Zinc chapter?
Stef On Wed, May 3, 2017 at 8:18 PM, Norbert Hartl <norb...@hartl.name> wrote: > > > > Am 03.05.2017 um 18:10 schrieb Cyril Ferlicot D. < > cyril.ferli...@gmail.com>: > > > >> Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit : > >> > >>> On 3 May 2017, at 12:18, Sven Van Caekenberghe <s...@stfx.eu> wrote: > >>> > >>> Hi Cyril, > >>> > >>> I want to try to write such a detector. I'll get back to you. > >> > >> I added the following (Zn #bleedingEdge): > >> > >> === > >> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49 > >> Author: SvenVanCaekenberghe > >> Time: 3 May 2017, 4:30:44.081888 pm > >> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc > >> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48 > >> > >> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically > and unreliably guess the encoding used by a collection of bytes > >> > >> Add ZnCharacterEncoderTests>>#testDetectEncoding > >> > >> Add #= and #hash to ZnSimplifiedByteEncoder and > ZnEndianSensitiveUTFEncoder > >> > >> Always use canonical name in ZnSimplifiedByteEncoder > class>>#newForEncoding: > >> === > >> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31 > >> Author: SvenVanCaekenberghe > >> Time: 3 May 2017, 4:31:09.469852 pm > >> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc > >> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30 > >> > >> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically > and unreliably guess the encoding used by a collection of bytes > >> > >> Add ZnCharacterEncoderTests>>#testDetectEncoding > >> > >> Add #= and #hash to ZnSimplifiedByteEncoder and > ZnEndianSensitiveUTFEncoder > >> > >> Always use canonical name in ZnSimplifiedByteEncoder > class>>#newForEncoding: > >> === > >> > >> > >> Now you can do the following: > >> > >> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') > binaryReadStreamDo: [ :in | in upToEnd ]). > >> > >> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | > >> | bytes encoder | > >> bytes := in upToEnd. > >> encoder := ZnCharacterEncoder detectEncoding: bytes. > >> encoder decodeBytes: bytes ]. > >> > >> It works on the test file you gave me, but this process is just a > guess, a heuristic that is unreliable and often wrong (especially for very > similar byte encodings), see https://en.wikipedia.org/wiki/ > Charset_detection. > >> > >> You can give the whole contents to the detector, or just a header. > >> > >> I was a bit too optimistic though, this is basically an unsolvable > problem. It is MUCH better to somehow know up front what the encoding used > is, or to know something useable about the contents (like the header of > HTML or XML). > >> > >> Sven > >> > > > > Thank you! I'll try this tomorrow. If it works well I wonder if we can > > still includes it in Pharo6. Since it's only a little feature unused in > > Pharo it should not break anything but it would be cool addition for > Moose. > > > > But since it is feature freeze if people do not want I'll not push it > > for Pharo 6 :) > > > It shouldn't be included. There no such thing as side-effect-free change. > Moose can load a newer version of zinc. That is how it is supposed to be. > > Norbert > > -- > > Cyril Ferlicot > > https://ferlicot.fr > > > > http://www.synectique.eu > > 2 rue Jacques Prévert 01, > > 59650 Villeneuve d'ascq France > > > > >