2015年3月11日 16:55于 "Philipp Hahn" <h...@univention.de>写道: > > Hello, > > On 11.03.2015 07:22, Xiaodong Gong wrote: > >> Hope that clarified things. > ... > > first,your patch is very clear,a good sample. > > > > store ascii code in kernel that I said before is a mistake,I mean the > > glibc need the input of arguments of fuction such as fopen(path)is > > ascii code > > No: > ASCII = *7* bit, see "man 7 ascii". > > Kernel = *8* bit, that is the kernel doesn't care if you use > ISO-8859-1, UTF-8, BIG5, GB2312, or any other encoding you can get by > running "iconv --list". > For the kernel only '\0'=0x00 and '/'=0x2f are special, all other > characters the kernel doesn't care for and passes them in and out > unmodified. > > Most character sets have the ASCII alphabet in their character range > 0x00-0x7f, which solved the '\0' and '/' issue nicely. > > So again: > - If you use opendir() and readdir(), the kernel returns to you a 8 bit > byte sequence. > - To convert that into a character sequence, you must know which > encoding was used when the file was created. This information is not > stored explicitly in the file-system, file-name, or anywhere else. > > - The only hint you get is LC_CTYPE, which is set by the user to tell > you which encoding should be used to convert a byte-stream into a > character-string. > > - If I create a new file on the Linux text console, where I did a > "unicode_start", I get an UTF-8 byte sequence from the input layer, > which is passed unmodified through the getty and by shell to the > create() call. You don't need to know the encoding, you just pass the > data in and out unmodified. > > - When typing "ls" the kernel again return that byte sequence, which > gets passed through the shell to the Linux frame buffer, which > translates that UTF-8 sequence to a character and picks the right glyph > for being displayed on the screen. > > - If I don't switch the Linux console to Unicode mode, I get a different > byte sequence from the input layer. That different byte sequence would > be stored on the disk when creating a file. (This translation is > installed by running the "loadkeys" command.) > > - If I do the same in X11, the translation from key-codes to characters > and back is done by the terminal (or application). See "man 1 xterm" for > "-lc" and "-u8". > > - BUT when you want to create a specific character encoding, you MUST > know from which encoding you start. Assuming ASCII or UTF-8 is wrong, > you MUST check LC_ALL/LC_CTYPE/LANG by querying nl_langinfo(CODESET)). > > So if I would give you a disk containing a file with the name "\xa1", > depending on the locale you would see a different glyph: > $ for ((c=1;c<=15;c++));do printf '\xa1'|recode > ISO-8859-$c..dump-with-names 2>/dev/null|tail -n 1;done > 00A1 !I inverted exclamation mark > 0104 A; latin capital letter a with ogonek > 0126 H/ latin capital letter h with stroke > 0104 A; latin capital letter a with ogonek > 0401 IO cyrillic capital letter io > 201B 9' single high-reversed-9 quotation mark > 00A1 !I inverted exclamation mark > 0104 A; latin capital letter a with ogonek > 00A1 !I inverted exclamation mark > 00A1 !I inverted exclamation mark > 1E02 B. latin capital letter b with dot above > 00A1 !I inverted exclamation mark > > In an UTF-8 environment you would get an error, as "\xa1" is not a valid > UTF-8 byte sequence. > > Read "man 7 unicode", especially the section "Unicode Under Linux" or > "man 7 charsets". > > > I think: > > Any program basically has two options: > > 1. The program does not care about different character sets ans just > passes in file-names and data in and out as byte streams. That is > perfectly okay and most UNIX shell commands work just fine that way. > > 2. The program is encoding aware, as for example it works on characters > instead of bytes (like "wc --bytes" vs. "wc --chars") or does need to > perform a conversion between encodings. Then the sanest thing is to > - query the encoding of the environment once (or per network connection), > - convert any input data from that encoding into a (fixed) internal > format like wchar/utf-16/utf-32 including file-names, file-content, etc. > - convert the internal data back into the right format on output, which > also includes calling APIs like open(). > > Otherwise you always have to remember if your char[] buffer contains > "some byte stream, which needs to be decoded() before being used" or > "already decoded character string". That is why most libraries and > frame-works contain wrappers for the file, input- and output, as they > internally use one data type for characters consistently and hide all > the explicit conversion from you by providing wrappers. > > > icovn_open(utf16le,ascii)in encode > > icovn_open(ascii,utf16le)in decode > > icovn_open(codeset,ascii)in show > > That would be correct ONLY if you store the file-name internally as > ASCII, which would prevent you from handling file-names containing any > character outside the ASCII codeset. > You should use "UTF-8" instead of "ascii", as that allows you to handle > file-names containing any valid characters. > This would make the conversion in show() trivial when codeset="UTF-8", > as there iconv() would not have to do anything there. > > Philipp
I get it. The Linux supported i18n a long time ago. I even forget the GB2312 :( I will change it to what you did in your vhd_util patch. Last, THANKS A LOT for your time.