Re: Linux filenames with definite encoding (Was: FTP server with intl support)

Shachar Shemesh Tue, 12 Feb 2002 04:26:26 -0800

Adi Stav wrote:

>On Tue, Feb 12, 2002 at 09:58:58AM +0200, Shachar Shemesh wrote:
>
>>Sorry for retreating in the thread, but an important note struck me from 
>>my past.
>>
>>Nadav Har'El wrote:
>>
>>>No, UNIX traditionally operates on strings of "chars" (bytes/octets). No
>>>special treatment is ever given by system calls to any byte except null
>>>(and "/" in pathnames)
>>>
>>Ok, what if the locale allows "/" as a valid byte?
>>
>>Think that is outragous? Then either think again, or try to port your 
>>app to Japanese. I am not 100% sure about "/", but "\" is a legitimate 
>>second-byte in some Japanese MBCS encoded characters. If your locale is 
>>Japanese, these characters are taken in as a whole, and just make out a 
>>path. If not, well, you can't use Japanese characters (unicode 
>>non-withstanding).
>>
>
>It's not outragous... ASCII allows "/" as a valid byte, and therefore 
>we can't use it in filenames. Whatever "/" stands for in Japanese, it
>can't be used as part of a file name. Tough. I'd prefer open() to take
>a list of names, but no one asked me :)
>
I am sorry, but you have managed to completely and utterly miss what I said.

What I said was this:
There are characters in Japanese MBCS encoding that take two bytes. 
These characters are not the "/" sign, but just characters in the 
language. However, when the encoding that composes these characters is 
viewed a single byte at a time, the byte which has the ASCII 
representation of "/" may appear there, even though the "/" character 
never appeared in the original sequence. As a result, whoever is doing 
the string parsing must be aware that they are parsing a Japanese 
encoded string, in order to avoid such collisions.

>>Now, I don't know the UTF-8 encoding, so I don't know how likely it is 
>>to happen there. Some attempt was made to avoid problematic characters. 
>>MBCS made sure null cannot be a second byte, for example. I do know that 
>>trying to use non-UTF to encode Japanese will require some OS support in 
>>the parsing of the string.
>>
>
>I love how UTF-8 is well-thought and compatible. If such trouble push
>people out of partial and incompatible encodings, so much for the 
>better.
>
>UTF-8 is designed to be 100% backwards compatible with ASCII -- the
>encoding of an ASCII string in UTF-8 is exactly the same. Series of 
>two or more non-ASCII characters (over 127) stand for higher Unicode. 
>So if you take a UTF-8 string of a Latin language text and try to 
>display it as if it were Latin-1, or alternatively pass it through
>a 7-bit-only system and try to display it as ASCII, most of it will
>come out intact (including the "/") and only the special characters
>will have noise in their place.
>
Again, totally irrelevant. If the "aleph" character happened to have a 
"/" as one of the bytes of the encoding, non-UTF parsers would not allow 
you to have a filename with Aleph. I am well aware that this doesn't 
happen, and am only brining that as a clarifying example for my previous 
claim that encoding free parsing is not possible.

                Shachar



=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]
Re: Linux filenames with definite encoding (Was: FTP server with intl support)

Reply via email to