On 2010-07-26 9:15 PM Aurelien Jarno wrote:
On Wed, Jul 14, 2010 at 11:53:56AM +0200, Aurelien Jarno wrote:
You have to be more specific about the problem, I don't see any
change between glibc based version and eglibc based version beside a
few more supported encoding.
glibc and eglibc don't differ on the iconv code.
I checked, and it seems I was getting confused between GNU libiconv
<http://www.gnu.org/software/libiconv/> and the glibc/eglibc
implementation of iconv.
GNU libiconv outputs the following from iconv -l, for example:
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
This makes it clear which names are equivalents. The glibc/eglibc iconv
just outputs these on separate lines. If it were possible to provide the
libiconv functionality, maybe using an additional option to iconv, that
would be helpful.
The bigger issue, however, is that glibc's iconv doesn't document what
the various encoding names mean, *anywhere*. Something like CP1149 can
be Googled and found in places like Wikipedia, but a name like "UNICODE"
is very ambiguous, and odd names like "CSUNICODE" don't return anything
very obvious in Google searches. In fact, the best description I found
was in the documentation for an entirely different library, recode
<http://www.delorie.com/gnu/docs/recode/recode_30.html>. I think
(e)glibc should do its own documentation and not rely on other sources.
The GNU libiconv is slightly better, because output from iconv -l
explains what CSUNICODE means by showing that it's the same as a
well-defined, unambiguous encoding (ISO-10646-UCS-2).
However, neither library explains byte order anywhere. I can get BE or
LE by specifying it explicitly in the encoding name, but typically I
need to get native and I don't want to have to do a runtime test for
endianness and then add it to the encoding name. How was I supposed to
know that UCS-2 means "native byte order" rather than some canonical
ordering such as big? Different iconv implementations actually differ on
this. On Mac OS X on Intel with either the system iconv and the MacPorts
version of GNU libiconv, UCS-2 actually means big-endian:
$ echo -ne '\xe2\x80\xa2' | iconv -f utf-8 -t ucs-2 | xxd
0000000: 2022
Running the same on Linux returns:
0000000: 2220
So if it's interpreted differently by different libraries, even though
they all implement the same standard, shouldn't the behaviour on Linux
be documented somewhere?
Any news about that?
Sorry for the delay. My email address forwards to gmail, which put both
of your messages in the spam folder :-( Normally, gmail's spam detection
is excellent so I don't bother to check it very often.
--Neil
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org