Hello, On 11.03.2015 07:22, Xiaodong Gong wrote: >> Hope that clarified things. ... > first,your patch is very clear,a good sample. > > store ascii code in kernel that I said before is a mistake,I mean the > glibc need the input of arguments of fuction such as fopen(path)is > ascii code
No: ASCII = *7* bit, see "man 7 ascii". Kernel = *8* bit, that is the kernel doesn't care if you use ISO-8859-1, UTF-8, BIG5, GB2312, or any other encoding you can get by running "iconv --list". For the kernel only '\0'=0x00 and '/'=0x2f are special, all other characters the kernel doesn't care for and passes them in and out unmodified. Most character sets have the ASCII alphabet in their character range 0x00-0x7f, which solved the '\0' and '/' issue nicely. So again: - If you use opendir() and readdir(), the kernel returns to you a 8 bit byte sequence. - To convert that into a character sequence, you must know which encoding was used when the file was created. This information is not stored explicitly in the file-system, file-name, or anywhere else. - The only hint you get is LC_CTYPE, which is set by the user to tell you which encoding should be used to convert a byte-stream into a character-string. - If I create a new file on the Linux text console, where I did a "unicode_start", I get an UTF-8 byte sequence from the input layer, which is passed unmodified through the getty and by shell to the create() call. You don't need to know the encoding, you just pass the data in and out unmodified. - When typing "ls" the kernel again return that byte sequence, which gets passed through the shell to the Linux frame buffer, which translates that UTF-8 sequence to a character and picks the right glyph for being displayed on the screen. - If I don't switch the Linux console to Unicode mode, I get a different byte sequence from the input layer. That different byte sequence would be stored on the disk when creating a file. (This translation is installed by running the "loadkeys" command.) - If I do the same in X11, the translation from key-codes to characters and back is done by the terminal (or application). See "man 1 xterm" for "-lc" and "-u8". - BUT when you want to create a specific character encoding, you MUST know from which encoding you start. Assuming ASCII or UTF-8 is wrong, you MUST check LC_ALL/LC_CTYPE/LANG by querying nl_langinfo(CODESET)). So if I would give you a disk containing a file with the name "\xa1", depending on the locale you would see a different glyph: $ for ((c=1;c<=15;c++));do printf '\xa1'|recode ISO-8859-$c..dump-with-names 2>/dev/null|tail -n 1;done 00A1 !I inverted exclamation mark 0104 A; latin capital letter a with ogonek 0126 H/ latin capital letter h with stroke 0104 A; latin capital letter a with ogonek 0401 IO cyrillic capital letter io 201B 9' single high-reversed-9 quotation mark 00A1 !I inverted exclamation mark 0104 A; latin capital letter a with ogonek 00A1 !I inverted exclamation mark 00A1 !I inverted exclamation mark 1E02 B. latin capital letter b with dot above 00A1 !I inverted exclamation mark In an UTF-8 environment you would get an error, as "\xa1" is not a valid UTF-8 byte sequence. Read "man 7 unicode", especially the section "Unicode Under Linux" or "man 7 charsets". > I think: Any program basically has two options: 1. The program does not care about different character sets ans just passes in file-names and data in and out as byte streams. That is perfectly okay and most UNIX shell commands work just fine that way. 2. The program is encoding aware, as for example it works on characters instead of bytes (like "wc --bytes" vs. "wc --chars") or does need to perform a conversion between encodings. Then the sanest thing is to - query the encoding of the environment once (or per network connection), - convert any input data from that encoding into a (fixed) internal format like wchar/utf-16/utf-32 including file-names, file-content, etc. - convert the internal data back into the right format on output, which also includes calling APIs like open(). Otherwise you always have to remember if your char[] buffer contains "some byte stream, which needs to be decoded() before being used" or "already decoded character string". That is why most libraries and frame-works contain wrappers for the file, input- and output, as they internally use one data type for characters consistently and hide all the explicit conversion from you by providing wrappers. > icovn_open(utf16le,ascii)in encode > icovn_open(ascii,utf16le)in decode > icovn_open(codeset,ascii)in show That would be correct ONLY if you store the file-name internally as ASCII, which would prevent you from handling file-names containing any character outside the ASCII codeset. You should use "UTF-8" instead of "ascii", as that allows you to handle file-names containing any valid characters. This would make the conversion in show() trivial when codeset="UTF-8", as there iconv() would not have to do anything there. Philipp