Adi Stav wrote:
>On Tue, Feb 12, 2002 at 09:58:58AM +0200, Shachar Shemesh wrote:
>
>>Sorry for retreating in the thread, but an important note struck me from
>>my past.
>>
>>Nadav Har'El wrote:
>>
>>>No, UNIX traditionally operates on strings of "chars" (bytes/octets). No
>>>special treatment is ever given by system calls to any byte except null
>>>(and "/" in pathnames)
>>>
>>Ok, what if the locale allows "/" as a valid byte?
>>
>>Think that is outragous? Then either think again, or try to port your
>>app to Japanese. I am not 100% sure about "/", but "\" is a legitimate
>>second-byte in some Japanese MBCS encoded characters. If your locale is
>>Japanese, these characters are taken in as a whole, and just make out a
>>path. If not, well, you can't use Japanese characters (unicode
>>non-withstanding).
>>
>
>It's not outragous... ASCII allows "/" as a valid byte, and therefore
>we can't use it in filenames. Whatever "/" stands for in Japanese, it
>can't be used as part of a file name. Tough. I'd prefer open() to take
>a list of names, but no one asked me :)
>
I am sorry, but you have managed to completely and utterly miss what I said.
What I said was this:
There are characters in Japanese MBCS encoding that take two bytes.
These characters are not the "/" sign, but just characters in the
language. However, when the encoding that composes these characters is
viewed a single byte at a time, the byte which has the ASCII
representation of "/" may appear there, even though the "/" character
never appeared in the original sequence. As a result, whoever is doing
the string parsing must be aware that they are parsing a Japanese
encoded string, in order to avoid such collisions.
>>Now, I don't know the UTF-8 encoding, so I don't know how likely it is
>>to happen there. Some attempt was made to avoid problematic characters.
>>MBCS made sure null cannot be a second byte, for example. I do know that
>>trying to use non-UTF to encode Japanese will require some OS support in
>>the parsing of the string.
>>
>
>I love how UTF-8 is well-thought and compatible. If such trouble push
>people out of partial and incompatible encodings, so much for the
>better.
>
>UTF-8 is designed to be 100% backwards compatible with ASCII -- the
>encoding of an ASCII string in UTF-8 is exactly the same. Series of
>two or more non-ASCII characters (over 127) stand for higher Unicode.
>So if you take a UTF-8 string of a Latin language text and try to
>display it as if it were Latin-1, or alternatively pass it through
>a 7-bit-only system and try to display it as ASCII, most of it will
>come out intact (including the "/") and only the special characters
>will have noise in their place.
>
Again, totally irrelevant. If the "aleph" character happened to have a
"/" as one of the bytes of the encoding, non-UTF parsers would not allow
you to have a filename with Aleph. I am well aware that this doesn't
happen, and am only brining that as a clarifying example for my previous
claim that encoding free parsing is not possible.
Shachar
=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]