Re: Support for non-UTF-8 platform charset

ISHIKAWA,chiaki Fri, 17 Jan 2014 05:29:57 -0800

(Sorry for top-posting.)

Dear Henri Sivonen,


Thank you for lucid explanation.

I am trying to understand the details and
am pondering inserting a few more dumps to
figure out the answers to your newly raised questions.

I will report my finding RSN.

Thank you again. Your comment clears up a lot of questions I had.

PS: One thing is for sure. I checked the 32-bit version log. There was

no mention of iconv() failure. So the failure of iconv() is a littlemysterious still.So I am not ruling out the 64-bit library problem under DebianGNU/Linux. (Have to find out a valid test suite for this.)


(2014/01/17 18:39), Henri Sivonen wrote:

On Thu, Jan 16, 2014 at 7:28 PM, ISHIKAWA,chiaki <ishik...@yk.rim.or.jp> wrote:

(2014/01/16 12:22), Zack Weinberg wrote:


On 2013-11-26 5:40 AM, Neil wrote:


Henri Sivonen wrote:

On Windows, do we really need to pay homage to the pre-NT legacy when
doing Save As? How about we just use UTF-8 for "HTML Page, complete"
reserialization like on Mac?

You'll need a BOM, of course.

(MXR turns up so little that it makes me wonder non-UTF-8 support
might have already gone away for practical purposes.)

Gecko gets a little confused when you run Linux on a non-UTF8 file
system, so I'd agree with this.



FWIW some time later, I've seen both Linux and *BSD devs state that even
if the user's locale uses a legacy encoding, all filesystem paths should
still be treated as UTF-8.


I don't know if my discovery below which seems to agree with what Zack found
is relevant here or not.
But here it goes. Seeing BOM in quoted Neil's post prompts me to write this.


I think Neil referred to a BOM inside serialized HTML files--not in
file system paths.

I noticed that 64-bit DEBUG BUILD of TB
seems to fail a conversion of native char string vs unicode near the
beginning of its invocation.
Specifically, iconv() failed each time (and to my chagrin,
it did not seem to fail on a different 32-bit Debian GNU/Linux. Not sure
why. Maybe the default locale that I used during installation of linux may
be different. See below.) I noticed the error in |make mozmill| test suite
log.

When I installed Debian GNU/Linux 64-bit,
I used Japanese locale : ja_JP.UTF-8

Since I refreshed source code of comm-central mid-December,
valgrind run of TB also reports strange illegal read of 4 byte
past allocated buffer in the code sequence that leads up to iconv() failure,
so I inserted many printf to dump data to see what is going on.


Seems like a bug.

I found that TB generates during its execution
UTF-8 file path name strings WITHOUT BOM and
still contain supposedly a valid UTF8 path name.


I'm pretty sure that file system paths on Linux are not supposed to
contain a BOM. Besides, a UTF-8 BOM is three bytes and, therefore,
doesn't explain an off-by-four error.

This seems to confuse the code in TB itself and iconv fails.
(And I think it is an expensive failure although  |iconv| seems to be called
only a few times during TB invocation.)

Case in point.
Linux GUI environment (or maybe it is common linux practice, I don't know)
has a file called
~/.config/user-dirs.dirs
under one's home directory (~).

And this file contains the mapping of Desktop,
Download, and host of other symbolic names used in popular Desktop
environment into one's L10N name that is shown
on the desktop, it seems: at least the file itself mentions this near
its beginning.

For example, in my |user-dirs.dirs| file, I have the following.
Raw image: not sure how it will show up on people's screen.

# This file is written by xdg-user-dirs-update
# If you want to change or add directories, just edit the line you're
# interested in. All local changes will be retained on the next run
# Format is XDG_xxx_DIR="$HOME/yyy", where yyy is a shell-escaped
# homedir-relative path, or XDG_xxx_DIR="/yyy", where /yyy is an
# absolute path. No other format is supported.
#
XDG_DESKTOP_DIR="$HOME/デスクトップ"
XDG_DOWNLOAD_DIR="$HOME/ダウンロード"
XDG_TEMPLATES_DIR="$HOME/テンプレート"
XDG_PUBLICSHARE_DIR="$HOME/公開"
XDG_DOCUMENTS_DIR="$HOME/ドキュメント"
XDG_MUSIC_DIR="$HOME/音楽"
XDG_PICTURES_DIR="$HOME/画像"
XDG_VIDEOS_DIR="$HOME/ビデオ"

Output of hexdump -C user-dirs.dirs:
00000000  23 20 54 68 69 73 20 66  69 6c 65 20 69 73 20 77  |# This file is
w|
00000010  72 69 74 74 65 6e 20 62  79 20 78 64 67 2d 75 73  |ritten by
xdg-us|
00000020  65 72 2d 64 69 72 73 2d  75 70 64 61 74 65 0a 23
|er-dirs-update.#|
00000030  20 49 66 20 79 6f 75 20  77 61 6e 74 20 74 6f 20  | If you want to
|
00000040  63 68 61 6e 67 65 20 6f  72 20 61 64 64 20 64 69  |change or add
di|
00000050  72 65 63 74 6f 72 69 65  73 2c 20 6a 75 73 74 20  |rectories, just
|
00000060  65 64 69 74 20 74 68 65  20 6c 69 6e 65 20 79 6f  |edit the line
yo|
00000070  75 27 72 65 0a 23 20 69  6e 74 65 72 65 73 74 65  |u're.#
intereste|
00000080  64 20 69 6e 2e 20 41 6c  6c 20 6c 6f 63 61 6c 20  |d in. All local
|
00000090  63 68 61 6e 67 65 73 20  77 69 6c 6c 20 62 65 20  |changes will be
|
000000a0  72 65 74 61 69 6e 65 64  20 6f 6e 20 74 68 65 20  |retained on the
|
000000b0  6e 65 78 74 20 72 75 6e  0a 23 20 46 6f 72 6d 61  |next run.#
Forma|
000000c0  74 20 69 73 20 58 44 47  5f 78 78 78 5f 44 49 52  |t is
XDG_xxx_DIR|
000000d0  3d 22 24 48 4f 4d 45 2f  79 79 79 22 2c 20 77 68 |="$HOME/yyy",
wh|
000000e0  65 72 65 20 79 79 79 20  69 73 20 61 20 73 68 65  |ere yyy is a
she|
000000f0  6c 6c 2d 65 73 63 61 70  65 64 0a 23 20 68 6f 6d |ll-escaped.#
hom|
00000100  65 64 69 72 2d 72 65 6c  61 74 69 76 65 20 70 61 |edir-relative
pa|
00000110  74 68 2c 20 6f 72 20 58  44 47 5f 78 78 78 5f 44  |th, or
XDG_xxx_D|
00000120  49 52 3d 22 2f 79 79 79  22 2c 20 77 68 65 72 65  |IR="/yyy",
where|
00000130  20 2f 79 79 79 20 69 73  20 61 6e 0a 23 20 61 62  | /yyy is an.#
ab|
00000140  73 6f 6c 75 74 65 20 70  61 74 68 2e 20 4e 6f 20  |solute path. No
|
00000150  6f 74 68 65 72 20 66 6f  72 6d 61 74 20 69 73 20  |other format is
|
00000160  73 75 70 70 6f 72 74 65  64 2e 0a 23 20 0a 58 44 |supported..#
.XD|
00000170  47 5f 44 45 53 4b 54 4f  50 5f 44 49 52 3d 22 24
|G_DESKTOP_DIR="$|
00000180  48 4f 4d 45 2f e3 83 87  e3 82 b9 e3 82 af e3 83
|HOME/...........|
00000190  88 e3 83 83 e3 83 97 22  0a 58 44 47 5f 44 4f 57
|.......".XDG_DOW|
000001a0  4e 4c 4f 41 44 5f 44 49  52 3d 22 24 48 4f 4d 45
|NLOAD_DIR="$HOME|
000001b0  2f e3 83 80 e3 82 a6 e3  83 b3 e3 83 ad e3 83 bc
|/...............|
000001c0  e3 83 89 22 0a 58 44 47  5f 54 45 4d 50 4c 41 54
|...".XDG_TEMPLAT|
000001d0  45 53 5f 44 49 52 3d 22  24 48 4f 4d 45 2f e3 83
|ES_DIR="$HOME/..|
000001e0  86 e3 83 b3 e3 83 97 e3  83 ac e3 83 bc e3 83 88
|................|
000001f0  22 0a 58 44 47 5f 50 55  42 4c 49 43 53 48 41 52
|".XDG_PUBLICSHAR|
00000200  45 5f 44 49 52 3d 22 24  48 4f 4d 45 2f e5 85 ac
|E_DIR="$HOME/...|
00000210  e9 96 8b 22 0a 58 44 47  5f 44 4f 43 55 4d 45 4e
|...".XDG_DOCUMEN|
00000220  54 53 5f 44 49 52 3d 22  24 48 4f 4d 45 2f e3 83
|TS_DIR="$HOME/..|
00000230  89 e3 82 ad e3 83 a5 e3  83 a1 e3 83 b3 e3 83 88
|................|
00000240  22 0a 58 44 47 5f 4d 55  53 49 43 5f 44 49 52 3d
|".XDG_MUSIC_DIR=|
00000250  22 24 48 4f 4d 45 2f e9  9f b3 e6 a5 bd 22 0a 58
|"$HOME/......".X|
00000260  44 47 5f 50 49 43 54 55  52 45 53 5f 44 49 52 3d
|DG_PICTURES_DIR=|
00000270  22 24 48 4f 4d 45 2f e7  94 bb e5 83 8f 22 0a 58
|"$HOME/......".X|
00000280  44 47 5f 56 49 44 45 4f  53 5f 44 49 52 3d 22 24
|DG_VIDEOS_DIR="$|
00000290  48 4f 4d 45 2f e3 83 93  e3 83 87 e3 82 aa 22 0a
|HOME/.........".|
000002a0


I am not sure what the byte '0xe3' is: this precedes each Japanese
character.


Each Japanese character is 3 bytes in UTF-8. Since the code points of
the Japanese characters are close together, they have the same most
significant bits and, therefore, they end up with the same UTF-8 lead
byte.

|file user-dirs.dirs| recognizes this file as utf-8 and prints
user-dirs.dirs: UTF-8 Unicode text

In mozilla code, the following function parses the |.config/user-dirs.dir|
file and produces the really used path name on user desktop for |DESKTOP|,
etc. $HOME in the path string in user-dirs.dirs is expanded during the
process. This would produce strings for localized names (but without BOM).

static char *
xdg_user_dir_lookup (const char *type)
in mozilla/xpcom/io/SpecialSystemDirectory.cpp


That C function is not exactly in Mozilla's style...

xdg_user_dir_lookup() is called and the produced string is
then passed to  NS_NewNativeLocalFile() inside the following function:
static nsresult
GetUnixXDGUserDirectory()
in mozilla/xpcom/io/SpecialSystemDirectory.cpp

The strings produced by xdg_user_dir_lookup() is passed to
NS_NewNativeLocalFile() in xpcom/io/nsNativeCharsetUtils.cpp
and then it is converted to the so called NativeCharset for file name
which, I think, is UTF-8 under linux's ext4 file system.


Yeah, "native" should be UTF-8 in this case.

But due to the way the strings are represented in
~/.config/user-dirs.dirs without BOM, I think
iconv() from libc fails to produce a valid conversion, and
reports failure.

For example, this is a log dump from my modified TB:

GetSpecialSystemDirectory: aSystemSystemDirectory=304
DEBUG:xdg_user_dir_lookup: type = DESKTOP <--- looking for
                            what is the file name for |DESKTOP|
DEBUG: relative = 1
DEBUG: GetUnixXDGUserDirectory: dir = /home/ishikawa/デスクトップ

        xdg_user_dir_lookup generated the above string, but
        without BOM. See the dump below.

DEBUG: dir[00] = 0x2f  <--- NO BOM
DEBUG: dir[01] = 0x68
DEBUG: dir[02] = 0x6f
DEBUG: dir[03] = 0x6d
DEBUG: dir[04] = 0x65
DEBUG: dir[05] = 0x2f
DEBUG: dir[06] = 0x69
DEBUG: dir[07] = 0x73
DEBUG: dir[08] = 0x68
DEBUG: dir[09] = 0x69
DEBUG: dir[10] = 0x6b
DEBUG: dir[11] = 0x61
DEBUG: dir[12] = 0x77
DEBUG: dir[13] = 0x61
DEBUG: dir[14] = 0x2f
DEBUG: dir[15] = 0xe3 <-- mysterious 0xe3
DEBUG: dir[16] = 0x83
DEBUG: dir[17] = 0x87
DEBUG: dir[18] = 0xe3
DEBUG: dir[19] = 0x82
DEBUG: dir[20] = 0xb9
DEBUG: dir[21] = 0xe3
DEBUG: dir[22] = 0x82
DEBUG: dir[23] = 0xaf
DEBUG: dir[24] = 0xe3
DEBUG: dir[25] = 0x83
DEBUG: dir[26] = 0x88
DEBUG: dir[27] = 0xe3
DEBUG: dir[28] = 0x83
DEBUG: dir[29] = 0x83
DEBUG: dir[30] = 0xe3
DEBUG: dir[31] = 0x83
DEBUG: dir[32] = 0x97
ERROR: errno=84
(This is error EILSEQ (Illegal byte sequence) on my linux )


Weird. The above is a legitimate UTF-8 byte sequence.

[29655] WARNING: conversion from native to utf-16(?) failed: file
/REF-COMM-CENTRAL/comm-central/mozilla/xpcom/io/nsNativeCharsetUtils.cpp,
line 470
(CI commet: the above reference to 'utf-16' may be bogus. I think I put this
message myself, and maybe this ought to be 'utf-8' instead.)


No, UTF-16 is right.

ERROR: res = 0xffffffffffffffff, outLeft=36, inLeft=18
[29655] ###!!! ASSERTION: iconv failed: 'Error', file
/REF-COMM-CENTRAL/comm-central/mozilla/xpcom/io/nsNativeCharsetUtils.cpp,
line 545

I think the naive assumption about inLeft*2 == outLeft may not be valid
here, also. (I think the valid output length is smaller than inLeft * 2 due
the mysterious 0xe3 octets.)


Where does it assume that? AFAICT, the multiplications by two don't
mix input and output but are there to count UTF-16 lengths in bytes,
because the inconv API uses byte buffers even for UTF-16.

I think TB eventually figures out the file pathname by sheer
heuristics: see the complex processing in nsNativeCharsetUtils.cpp.

Anyway, some inconsistent processing language coding choice, and
the particular internal format used for supposedly UTF-8 file path name, has
made the code in nsNAtiveCharsetUtils.cpp very complex and still iconv()
fails.


All this use of iconv is sad, yes. I wouldn't be opposed to dropping
the iconv code paths and using the OS X / Android code (that assumes
that operating system's file system APIs always take UTF-8) for all
*nix platforms.

I wonder if before this code runs
iconv_t nsNativeCharsetConverter::gNativeToUnicode
really gets initialized to a value that means conversion from UTF-8 to
machine-endian UTF-16 and
iconv_t nsNativeCharsetConverter::gUnicodeToNative
really gets initialized to a value that means machine-endian UTF-16 to UTF-8.

We have code that switches the process-wide C library locale back and
forth (terrible idea!), so one angle to investigate is if some part of
Gecko overwrites the process locale before
nsNativeCharsetConverter::LazyInit() that initializes these iconv_t
globals has a chance to see that the system locale is an UTF-8 locale.

You could try changing the line
const char  *native_charset = nl_langinfo(CODESET);
to
const char  *native_charset = "UTF-8";
in
nsNativeCharsetConverter::LazyInit().

If this fixes the problem, the cause is probably another part of the
code base overwriting what nl_langinfo(CODESET) needs to read before
nl_langinfo(CODESET) gets its chance to run.

Maybe the old code donated by RedHat, that is
xdg_user_dir_lookup, may need fixing.
This function also seems to generates a buffer area one byte too short
during run-time due to the mysterious 0x3e. I am still not sure whether
valgrind warning is
for real or false-positive.

The invalid 4 byte read reported by valgrind seems to be a
false-positive: code produced by the latest GCC for strlen seems to
use WORD load to analyze a sequence of bytes for NUL termination and
valgrind warns when the end of WORD is after the end of the previously
malloced area. libc authors and gcc developers seem to claim that
the WORD read is used only when the last portion of word falls between
a 16 byte region or something (and malloc usually returns an area of a size
rounded up to 16 and thus such word read is safe.)


Sounds like a false positive, then.

Right, this post may NOT be directly related to non-UTF-8 support, but
the some issues for native <==> UTF-8 conversion handling which I found over
the last few months need mentioning and recording for future maintainer of
TB. Although file path name is supposed to be UTF-8 these days across
platforms, some
support infrastructure may not pass around quite valid UTF-8 strings among
themselves although the processing may end up correctly after heuristics.
iconv() failure was quite annoying in |make mozmill| test suite log, and is
still there, but fixing it does not seem to be easy at all.


Does |make mozmill| pass if you change
#if defined(XP_MACOSX) || defined(ANDROID)
at the top of nsNativeCharsetUtils.cpp to say
#if defined(XP_MACOSX) || defined(ANDROID) || defined(XP_UNIX)
?




_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Support for non-UTF-8 platform charset

Reply via email to