2015年3月11日 16:55于 "Philipp Hahn" <h...@univention.de>写道:
> Hello,
> On 11.03.2015 07:22, Xiaodong Gong wrote:
> >> Hope that clarified things.
> ...
> > first,your patch is very clear,a good sample.
> >
> > store ascii code in kernel that I said before is a mistake,I mean the
> > glibc need the input of arguments of fuction such as fopen(path)is
> > ascii code
> No:
>  ASCII = *7* bit, see "man 7 ascii".
>  Kernel = *8* bit, that is the kernel doesn't care if you use
> ISO-8859-1, UTF-8, BIG5, GB2312, or any other encoding you can get by
> running "iconv --list".
> For the kernel only '\0'=0x00 and '/'=0x2f are special, all other
> characters the kernel doesn't care for and passes them in and out
> unmodified.
> Most character sets have the ASCII alphabet in their character range
> 0x00-0x7f, which solved the '\0' and '/' issue nicely.
> So again:
> - If you use opendir() and readdir(), the kernel returns to you a 8 bit
> byte sequence.
> - To convert that into a character sequence, you must know which
> encoding was used when the file was created. This information is not
> stored explicitly in the file-system, file-name, or anywhere else.
> - The only hint you get is LC_CTYPE, which is set by the user to tell
> you which encoding should be used to convert a byte-stream into a
> character-string.
> - If I create a new file on the Linux text console, where I did a
> "unicode_start", I get an UTF-8 byte sequence from the input layer,
> which is passed unmodified through the getty and by shell to the
> create() call. You don't need to know the encoding, you just pass the
> data in and out unmodified.
> - When typing "ls" the kernel again return that byte sequence, which
> gets passed through the shell to the Linux frame buffer, which
> translates that UTF-8 sequence to a character and picks the right glyph
> for being displayed on the screen.
> - If I don't switch the Linux console to Unicode mode, I get a different
> byte sequence from the input layer. That different byte sequence would
> be stored on the disk when creating a file. (This translation is
> installed by running the "loadkeys" command.)
> - If I do the same in X11, the translation from key-codes to characters
> and back is done by the terminal (or application). See "man 1 xterm" for
> "-lc" and "-u8".
> - BUT when you want to create a specific character encoding, you MUST
> know from which encoding you start. Assuming ASCII or UTF-8 is wrong,
> you MUST check LC_ALL/LC_CTYPE/LANG by querying nl_langinfo(CODESET)).
> So if I would give you a disk containing a file with the name "\xa1",
> depending on the locale you would see a different glyph:
> $ for ((c=1;c<=15;c++));do printf '\xa1'|recode
> ISO-8859-$c..dump-with-names 2>/dev/null|tail -n 1;done
> 00A1   !I    inverted exclamation mark
> 0104   A;    latin capital letter a with ogonek
> 0126   H/    latin capital letter h with stroke
> 0104   A;    latin capital letter a with ogonek
> 0401   IO    cyrillic capital letter io
> 201B   9'    single high-reversed-9 quotation mark
> 00A1   !I    inverted exclamation mark
> 0104   A;    latin capital letter a with ogonek
> 00A1   !I    inverted exclamation mark
> 00A1   !I    inverted exclamation mark
> 1E02   B.    latin capital letter b with dot above
> 00A1   !I    inverted exclamation mark
> In an UTF-8 environment you would get an error, as "\xa1" is not a valid
> UTF-8 byte sequence.
> Read "man 7 unicode", especially the section "Unicode Under Linux" or
> "man 7 charsets".
> > I think:
> Any program basically has two options:
> 1. The program does not care about different character sets ans just
> passes in file-names and data in and out as byte streams. That is
> perfectly okay and most UNIX shell commands work just fine that way.
> 2. The program is encoding aware, as for example it works on characters
> instead of bytes (like "wc --bytes" vs. "wc --chars") or does need to
> perform a conversion between encodings. Then the sanest thing is to
> - query the encoding of the environment once (or per network connection),
> - convert any input data from that encoding into a (fixed) internal
> format like wchar/utf-16/utf-32 including file-names, file-content, etc.
> - convert the internal data back into the right format on output, which
> also includes calling APIs like open().
> Otherwise you always have to remember if your char[] buffer contains
> "some byte stream, which needs to be decoded() before being used" or
> "already decoded character string". That is why most libraries and
> frame-works contain wrappers for the file, input- and output, as they
> internally use one data type for characters consistently and hide all
> the explicit conversion from you by providing wrappers.
> > icovn_open(utf16le,ascii)in encode
> > icovn_open(ascii,utf16le)in decode
> > icovn_open(codeset,ascii)in show
> That would be correct ONLY if you store the file-name internally as
> ASCII, which would prevent you from handling file-names containing any
> character outside the ASCII codeset.
> You should use "UTF-8" instead of "ascii", as that allows you to handle
> file-names containing any valid characters.
> This would make the conversion in show() trivial when codeset="UTF-8",
> as there iconv() would not have to do anything there.
> Philipp

I get it. The Linux supported i18n a long time ago. I even forget the
GB2312 :(

I will  change it to what you did in your vhd_util patch.

Last, THANKS A LOT for your time.

Reply via email to