> On 3 May 2017, at 12:18, Sven Van Caekenberghe <s...@stfx.eu> wrote: > > Hi Cyril, > > I want to try to write such a detector. I'll get back to you.
I added the following (Zn #bleedingEdge): === Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49 Author: SvenVanCaekenberghe Time: 3 May 2017, 4:30:44.081888 pm UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48 Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes Add ZnCharacterEncoderTests>>#testDetectEncoding Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: === Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31 Author: SvenVanCaekenberghe Time: 3 May 2017, 4:31:09.469852 pm UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30 Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes Add ZnCharacterEncoderTests>>#testDetectEncoding Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: === Now you can do the following: ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]). (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | | bytes encoder | bytes := in upToEnd. encoder := ZnCharacterEncoder detectEncoding: bytes. encoder decodeBytes: bytes ]. It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection. You can give the whole contents to the detector, or just a header. I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML). Sven > Any chance you could give me (part of) a file that causes you trouble (one > that is legal latin1, yet does not fail utf-8 while doing it wrong in utf-8) ? > > Sven > >> On 3 May 2017, at 11:40, Cyril Ferlicot D. <cyril.ferli...@gmail.com> wrote: >> >> Hello, >> >> We have a problem using Moose because we have files which we don't know >> the encoding. Currently we have this implementation to get the content >> of a file: >> >> completeText >> self fileReference exists ifFalse: [ ^ '' ]. >> ^ self fileReference readStreamDo: [ :s | >> [ s contents ] >> on: Error >> do: [ [ s converter: Latin1TextConverter new; contents ] >> on: Error >> do: [ '' ] ] ] >> >> But, we have a problem because we have currently some files at >> Synectique in ISO-8859-1. The problem is that #contents is able to read >> some of the files without throwing an error, but the content is not >> right because it is not the good encoding. >> >> Thus I wonder if it is possible to get the Encoding of a FileReference >> in Pharo to be able to read the file with the right encoding? Something >> like the bash command `file -I myFile.txt`. >> >> -- >> Cyril Ferlicot >> https://ferlicot.fr >> >> http://www.synectique.eu >> 2 rue Jacques Prévert 01, >> 59650 Villeneuve d'ascq France >> >