On 25 Sep 2014, at 5:00 , Hilaire Fernandes <hila...@drgeo.eu> wrote:
> Le 24/09/2014 18:48, Benjamin Pollack a écrit : >> On Tue, 23 Sep 2014 08:51:54 -0400, Hilaire <hila...@drgeo.eu> wrote: >> >>> Le 23/09/2014 14:09, Damien Cassou a écrit : >>>> I recently read documents about utf-8 encoding. In all of them, the >>>> author says that pathnames should be kept as is because you never know >>>> which encoding the filesystem uses. So, a filename should probably be >>>> a bytearray. >>> >>> >>> yes, but a #é should be encoded in two bytes. >> >> As noted in my previous message, "é" could be represented as either >> one or two Unicode code points, and these in turn could validly be >> either two or three bytes in UTF-8. My gut says that $é should be >> U+00E9, because otherwise you should have to use two Characters ($e >> and $´), but you could legitimately argue otherwise as well, and at >> any rate, #é could definitely be either. This is likely the core of >> the issue you're hitting. > As I understand it, #é should be encoded on two bytes and only two byte. > Only ASCII is coded as 1 byte with UTF-8. > See ref. on Wikipedia Hilaire: Benjamin is talking about which unicode normalization form é should be represented in, which is orthogonal to the encoding; http://en.wikipedia.org/wiki/Unicode_equivalence#Combining_and_precomposed_characters . So é can indeed be encoded in two different ways in utf8 (as in any other encoding), both as #[c3 a9] (encoding U+E9, "Latin small letter e with acute"), and as #[65 cc 81] (encoding U+65, "Latin small letter e", followed by U+0301, "Combining accute accent") Benjamin: Since the base path that contains the problematic character originates from a filesystem primitive, we can safely assume it's already in a canonical form*, Pharo does no automatic normalization. (that is, if the path would have been e + ´, the internal string would have two separate characters as well) Cheers, Henry * Only Mac OSX defines a canonical form for its paths anyways, the others don't care
signature.asc
Description: Message signed with OpenPGP using GPGMail