On 25 Sep 2014, at 5:00 , Hilaire Fernandes <hila...@drgeo.eu> wrote:

> Le 24/09/2014 18:48, Benjamin Pollack a écrit :
>> On Tue, 23 Sep 2014 08:51:54 -0400, Hilaire <hila...@drgeo.eu> wrote:
>> 
>>> Le 23/09/2014 14:09, Damien Cassou a écrit :
>>>> I recently read documents about utf-8 encoding. In all of them, the
>>>> author says that pathnames should be kept as is because you never know
>>>> which encoding the filesystem uses. So, a filename should probably be
>>>> a bytearray.
>>> 
>>> 
>>> yes, but a #é should be encoded in two bytes.
>> 
>> As noted in my previous message, "é" could be represented as either
>> one or two Unicode code points, and these in turn could validly be
>> either two or three bytes in UTF-8.  My gut says that $é should be
>> U+00E9, because otherwise you should have to use two Characters ($e
>> and $´), but you could legitimately argue otherwise as well, and at
>> any rate, #é could definitely be either.  This is likely the core of
>> the issue you're hitting.
> As I understand it, #é should be encoded on two bytes and only two byte.
> Only ASCII is coded as 1 byte with UTF-8.
> See ref. on Wikipedia

Hilaire: Benjamin is talking about which unicode normalization form é should be 
represented in, which is orthogonal to the encoding; 
http://en.wikipedia.org/wiki/Unicode_equivalence#Combining_and_precomposed_characters
 .
So é can indeed be encoded in two different ways in utf8 (as in any other 
encoding), both as #[c3 a9] (encoding U+E9, "Latin small letter e with acute"), 
and as #[65 cc 81] (encoding U+65, "Latin small letter e", followed by U+0301, 
"Combining accute accent")

Benjamin: Since the base path that contains the problematic character 
originates from a filesystem primitive, we can safely assume it's already in a 
canonical form*, Pharo does no automatic normalization. (that is, if the path 
would have been e + ´, the internal string would have two separate characters 
as well)

Cheers,
Henry

* Only Mac OSX defines a canonical form for its paths anyways, the others don't 
care

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to