On Thu, Jan 16, 2014 at 7:28 PM, ISHIKAWA,chiaki <ishik...@yk.rim.or.jp> wrote:
> (2014/01/16 12:22), Zack Weinberg wrote:
>>
>> On 2013-11-26 5:40 AM, Neil wrote:
>>>
>>> Henri Sivonen wrote:
>>>
>>>> On Windows, do we really need to pay homage to the pre-NT legacy when
>>>> doing Save As? How about we just use UTF-8 for "HTML Page, complete"
>>>> reserialization like on Mac?
>>>>
>>>>
>>> You'll need a BOM, of course.
>>>
>>>> (MXR turns up so little that it makes me wonder non-UTF-8 support
>>>> might have already gone away for practical purposes.)
>>>>
>>> Gecko gets a little confused when you run Linux on a non-UTF8 file
>>> system, so I'd agree with this.
>>
>>
>> FWIW some time later, I've seen both Linux and *BSD devs state that even
>> if the user's locale uses a legacy encoding, all filesystem paths should
>> still be treated as UTF-8.
>>
>
> I don't know if my discovery below which seems to agree with what Zack found
> is relevant here or not.
> But here it goes. Seeing BOM in quoted Neil's post prompts me to write this.

I think Neil referred to a BOM inside serialized HTML files--not in
file system paths.

> I noticed that 64-bit DEBUG BUILD of TB
> seems to fail a conversion of native char string vs unicode near the
> beginning of its invocation.
> Specifically, iconv() failed each time (and to my chagrin,
> it did not seem to fail on a different 32-bit Debian GNU/Linux. Not sure
> why. Maybe the default locale that I used during installation of linux may
> be different. See below.) I noticed the error in |make mozmill| test suite
> log.
>
> When I installed Debian GNU/Linux 64-bit,
> I used Japanese locale : ja_JP.UTF-8
>
> Since I refreshed source code of comm-central mid-December,
> valgrind run of TB also reports strange illegal read of 4 byte
> past allocated buffer in the code sequence that leads up to iconv() failure,
> so I inserted many printf to dump data to see what is going on.

Seems like a bug.

> I found that TB generates during its execution
> UTF-8 file path name strings WITHOUT BOM and
> still contain supposedly a valid UTF8 path name.

I'm pretty sure that file system paths on Linux are not supposed to
contain a BOM. Besides, a UTF-8 BOM is three bytes and, therefore,
doesn't explain an off-by-four error.

> This seems to confuse the code in TB itself and iconv fails.
> (And I think it is an expensive failure although  |iconv| seems to be called
> only a few times during TB invocation.)
>
> Case in point.
> Linux GUI environment (or maybe it is common linux practice, I don't know)
> has a file called
> ~/.config/user-dirs.dirs
> under one's home directory (~).
>
> And this file contains the mapping of Desktop,
> Download, and host of other symbolic names used in popular Desktop
> environment into one's L10N name that is shown
> on the desktop, it seems: at least the file itself mentions this near
> its beginning.
>
> For example, in my |user-dirs.dirs| file, I have the following.
> Raw image: not sure how it will show up on people's screen.
>
> # This file is written by xdg-user-dirs-update
> # If you want to change or add directories, just edit the line you're
> # interested in. All local changes will be retained on the next run
> # Format is XDG_xxx_DIR="$HOME/yyy", where yyy is a shell-escaped
> # homedir-relative path, or XDG_xxx_DIR="/yyy", where /yyy is an
> # absolute path. No other format is supported.
> #
> XDG_DESKTOP_DIR="$HOME/デスクトップ"
> XDG_DOWNLOAD_DIR="$HOME/ダウンロード"
> XDG_TEMPLATES_DIR="$HOME/テンプレート"
> XDG_PUBLICSHARE_DIR="$HOME/公開"
> XDG_DOCUMENTS_DIR="$HOME/ドキュメント"
> XDG_MUSIC_DIR="$HOME/音楽"
> XDG_PICTURES_DIR="$HOME/画像"
> XDG_VIDEOS_DIR="$HOME/ビデオ"
>
> Output of hexdump -C user-dirs.dirs:
> 00000000  23 20 54 68 69 73 20 66  69 6c 65 20 69 73 20 77  |# This file is
> w|
> 00000010  72 69 74 74 65 6e 20 62  79 20 78 64 67 2d 75 73  |ritten by
> xdg-us|
> 00000020  65 72 2d 64 69 72 73 2d  75 70 64 61 74 65 0a 23
> |er-dirs-update.#|
> 00000030  20 49 66 20 79 6f 75 20  77 61 6e 74 20 74 6f 20  | If you want to
> |
> 00000040  63 68 61 6e 67 65 20 6f  72 20 61 64 64 20 64 69  |change or add
> di|
> 00000050  72 65 63 74 6f 72 69 65  73 2c 20 6a 75 73 74 20  |rectories, just
> |
> 00000060  65 64 69 74 20 74 68 65  20 6c 69 6e 65 20 79 6f  |edit the line
> yo|
> 00000070  75 27 72 65 0a 23 20 69  6e 74 65 72 65 73 74 65  |u're.#
> intereste|
> 00000080  64 20 69 6e 2e 20 41 6c  6c 20 6c 6f 63 61 6c 20  |d in. All local
> |
> 00000090  63 68 61 6e 67 65 73 20  77 69 6c 6c 20 62 65 20  |changes will be
> |
> 000000a0  72 65 74 61 69 6e 65 64  20 6f 6e 20 74 68 65 20  |retained on the
> |
> 000000b0  6e 65 78 74 20 72 75 6e  0a 23 20 46 6f 72 6d 61  |next run.#
> Forma|
> 000000c0  74 20 69 73 20 58 44 47  5f 78 78 78 5f 44 49 52  |t is
> XDG_xxx_DIR|
> 000000d0  3d 22 24 48 4f 4d 45 2f  79 79 79 22 2c 20 77 68 |="$HOME/yyy",
> wh|
> 000000e0  65 72 65 20 79 79 79 20  69 73 20 61 20 73 68 65  |ere yyy is a
> she|
> 000000f0  6c 6c 2d 65 73 63 61 70  65 64 0a 23 20 68 6f 6d |ll-escaped.#
> hom|
> 00000100  65 64 69 72 2d 72 65 6c  61 74 69 76 65 20 70 61 |edir-relative
> pa|
> 00000110  74 68 2c 20 6f 72 20 58  44 47 5f 78 78 78 5f 44  |th, or
> XDG_xxx_D|
> 00000120  49 52 3d 22 2f 79 79 79  22 2c 20 77 68 65 72 65  |IR="/yyy",
> where|
> 00000130  20 2f 79 79 79 20 69 73  20 61 6e 0a 23 20 61 62  | /yyy is an.#
> ab|
> 00000140  73 6f 6c 75 74 65 20 70  61 74 68 2e 20 4e 6f 20  |solute path. No
> |
> 00000150  6f 74 68 65 72 20 66 6f  72 6d 61 74 20 69 73 20  |other format is
> |
> 00000160  73 75 70 70 6f 72 74 65  64 2e 0a 23 20 0a 58 44 |supported..#
> .XD|
> 00000170  47 5f 44 45 53 4b 54 4f  50 5f 44 49 52 3d 22 24
> |G_DESKTOP_DIR="$|
> 00000180  48 4f 4d 45 2f e3 83 87  e3 82 b9 e3 82 af e3 83
> |HOME/...........|
> 00000190  88 e3 83 83 e3 83 97 22  0a 58 44 47 5f 44 4f 57
> |.......".XDG_DOW|
> 000001a0  4e 4c 4f 41 44 5f 44 49  52 3d 22 24 48 4f 4d 45
> |NLOAD_DIR="$HOME|
> 000001b0  2f e3 83 80 e3 82 a6 e3  83 b3 e3 83 ad e3 83 bc
> |/...............|
> 000001c0  e3 83 89 22 0a 58 44 47  5f 54 45 4d 50 4c 41 54
> |...".XDG_TEMPLAT|
> 000001d0  45 53 5f 44 49 52 3d 22  24 48 4f 4d 45 2f e3 83
> |ES_DIR="$HOME/..|
> 000001e0  86 e3 83 b3 e3 83 97 e3  83 ac e3 83 bc e3 83 88
> |................|
> 000001f0  22 0a 58 44 47 5f 50 55  42 4c 49 43 53 48 41 52
> |".XDG_PUBLICSHAR|
> 00000200  45 5f 44 49 52 3d 22 24  48 4f 4d 45 2f e5 85 ac
> |E_DIR="$HOME/...|
> 00000210  e9 96 8b 22 0a 58 44 47  5f 44 4f 43 55 4d 45 4e
> |...".XDG_DOCUMEN|
> 00000220  54 53 5f 44 49 52 3d 22  24 48 4f 4d 45 2f e3 83
> |TS_DIR="$HOME/..|
> 00000230  89 e3 82 ad e3 83 a5 e3  83 a1 e3 83 b3 e3 83 88
> |................|
> 00000240  22 0a 58 44 47 5f 4d 55  53 49 43 5f 44 49 52 3d
> |".XDG_MUSIC_DIR=|
> 00000250  22 24 48 4f 4d 45 2f e9  9f b3 e6 a5 bd 22 0a 58
> |"$HOME/......".X|
> 00000260  44 47 5f 50 49 43 54 55  52 45 53 5f 44 49 52 3d
> |DG_PICTURES_DIR=|
> 00000270  22 24 48 4f 4d 45 2f e7  94 bb e5 83 8f 22 0a 58
> |"$HOME/......".X|
> 00000280  44 47 5f 56 49 44 45 4f  53 5f 44 49 52 3d 22 24
> |DG_VIDEOS_DIR="$|
> 00000290  48 4f 4d 45 2f e3 83 93  e3 83 87 e3 82 aa 22 0a
> |HOME/.........".|
> 000002a0
>
>
> I am not sure what the byte '0xe3' is: this precedes each Japanese
> character.

Each Japanese character is 3 bytes in UTF-8. Since the code points of
the Japanese characters are close together, they have the same most
significant bits and, therefore, they end up with the same UTF-8 lead
byte.

> |file user-dirs.dirs| recognizes this file as utf-8 and prints
> user-dirs.dirs: UTF-8 Unicode text
>
> In mozilla code, the following function parses the |.config/user-dirs.dir|
> file and produces the really used path name on user desktop for |DESKTOP|,
> etc. $HOME in the path string in user-dirs.dirs is expanded during the
> process. This would produce strings for localized names (but without BOM).
>
> static char *
> xdg_user_dir_lookup (const char *type)
> in mozilla/xpcom/io/SpecialSystemDirectory.cpp

That C function is not exactly in Mozilla's style...

> xdg_user_dir_lookup() is called and the produced string is
> then passed to  NS_NewNativeLocalFile() inside the following function:
> static nsresult
> GetUnixXDGUserDirectory()
> in mozilla/xpcom/io/SpecialSystemDirectory.cpp
>
> The strings produced by xdg_user_dir_lookup() is passed to
> NS_NewNativeLocalFile() in xpcom/io/nsNativeCharsetUtils.cpp
> and then it is converted to the so called NativeCharset for file name
> which, I think, is UTF-8 under linux's ext4 file system.

Yeah, "native" should be UTF-8 in this case.

> But due to the way the strings are represented in
> ~/.config/user-dirs.dirs without BOM, I think
> iconv() from libc fails to produce a valid conversion, and
> reports failure.
>
> For example, this is a log dump from my modified TB:
>
> GetSpecialSystemDirectory: aSystemSystemDirectory=304
> DEBUG:xdg_user_dir_lookup: type = DESKTOP <--- looking for
>                            what is the file name for |DESKTOP|
> DEBUG: relative = 1
> DEBUG: GetUnixXDGUserDirectory: dir = /home/ishikawa/デスクトップ
>
>        xdg_user_dir_lookup generated the above string, but
>        without BOM. See the dump below.
>
> DEBUG: dir[00] = 0x2f  <--- NO BOM
> DEBUG: dir[01] = 0x68
> DEBUG: dir[02] = 0x6f
> DEBUG: dir[03] = 0x6d
> DEBUG: dir[04] = 0x65
> DEBUG: dir[05] = 0x2f
> DEBUG: dir[06] = 0x69
> DEBUG: dir[07] = 0x73
> DEBUG: dir[08] = 0x68
> DEBUG: dir[09] = 0x69
> DEBUG: dir[10] = 0x6b
> DEBUG: dir[11] = 0x61
> DEBUG: dir[12] = 0x77
> DEBUG: dir[13] = 0x61
> DEBUG: dir[14] = 0x2f
> DEBUG: dir[15] = 0xe3 <-- mysterious 0xe3
> DEBUG: dir[16] = 0x83
> DEBUG: dir[17] = 0x87
> DEBUG: dir[18] = 0xe3
> DEBUG: dir[19] = 0x82
> DEBUG: dir[20] = 0xb9
> DEBUG: dir[21] = 0xe3
> DEBUG: dir[22] = 0x82
> DEBUG: dir[23] = 0xaf
> DEBUG: dir[24] = 0xe3
> DEBUG: dir[25] = 0x83
> DEBUG: dir[26] = 0x88
> DEBUG: dir[27] = 0xe3
> DEBUG: dir[28] = 0x83
> DEBUG: dir[29] = 0x83
> DEBUG: dir[30] = 0xe3
> DEBUG: dir[31] = 0x83
> DEBUG: dir[32] = 0x97
> ERROR: errno=84
> (This is error EILSEQ (Illegal byte sequence) on my linux )

Weird. The above is a legitimate UTF-8 byte sequence.

> [29655] WARNING: conversion from native to utf-16(?) failed: file
> /REF-COMM-CENTRAL/comm-central/mozilla/xpcom/io/nsNativeCharsetUtils.cpp,
> line 470
> (CI commet: the above reference to 'utf-16' may be bogus. I think I put this
> message myself, and maybe this ought to be 'utf-8' instead.)

No, UTF-16 is right.

> ERROR: res = 0xffffffffffffffff, outLeft=36, inLeft=18
> [29655] ###!!! ASSERTION: iconv failed: 'Error', file
> /REF-COMM-CENTRAL/comm-central/mozilla/xpcom/io/nsNativeCharsetUtils.cpp,
> line 545
>
> I think the naive assumption about inLeft*2 == outLeft may not be valid
> here, also. (I think the valid output length is smaller than inLeft * 2 due
> the mysterious 0xe3 octets.)

Where does it assume that? AFAICT, the multiplications by two don't
mix input and output but are there to count UTF-16 lengths in bytes,
because the inconv API uses byte buffers even for UTF-16.

> I think TB eventually figures out the file pathname by sheer
> heuristics: see the complex processing in nsNativeCharsetUtils.cpp.
>
> Anyway, some inconsistent processing language coding choice, and
> the particular internal format used for supposedly UTF-8 file path name, has
> made the code in nsNAtiveCharsetUtils.cpp very complex and still iconv()
> fails.

All this use of iconv is sad, yes. I wouldn't be opposed to dropping
the iconv code paths and using the OS X / Android code (that assumes
that operating system's file system APIs always take UTF-8) for all
*nix platforms.

I wonder if before this code runs
iconv_t nsNativeCharsetConverter::gNativeToUnicode
really gets initialized to a value that means conversion from UTF-8 to
machine-endian UTF-16 and
iconv_t nsNativeCharsetConverter::gUnicodeToNative
really gets initialized to a value that means machine-endian UTF-16 to UTF-8.

We have code that switches the process-wide C library locale back and
forth (terrible idea!), so one angle to investigate is if some part of
Gecko overwrites the process locale before
nsNativeCharsetConverter::LazyInit() that initializes these iconv_t
globals has a chance to see that the system locale is an UTF-8 locale.

You could try changing the line
const char  *native_charset = nl_langinfo(CODESET);
to
const char  *native_charset = "UTF-8";
in
nsNativeCharsetConverter::LazyInit().

If this fixes the problem, the cause is probably another part of the
code base overwriting what nl_langinfo(CODESET) needs to read before
nl_langinfo(CODESET) gets its chance to run.

> Maybe the old code donated by RedHat, that is
> xdg_user_dir_lookup, may need fixing.
> This function also seems to generates a buffer area one byte too short
> during run-time due to the mysterious 0x3e. I am still not sure whether
> valgrind warning is
> for real or false-positive.
>
> The invalid 4 byte read reported by valgrind seems to be a
> false-positive: code produced by the latest GCC for strlen seems to
> use WORD load to analyze a sequence of bytes for NUL termination and
> valgrind warns when the end of WORD is after the end of the previously
> malloced area. libc authors and gcc developers seem to claim that
> the WORD read is used only when the last portion of word falls between
> a 16 byte region or something (and malloc usually returns an area of a size
> rounded up to 16 and thus such word read is safe.)

Sounds like a false positive, then.

> Right, this post may NOT be directly related to non-UTF-8 support, but
> the some issues for native <==> UTF-8 conversion handling which I found over
> the last few months need mentioning and recording for future maintainer of
> TB. Although file path name is supposed to be UTF-8 these days across
> platforms, some
> support infrastructure may not pass around quite valid UTF-8 strings among
> themselves although the processing may end up correctly after heuristics.
> iconv() failure was quite annoying in |make mozmill| test suite log, and is
> still there, but fixing it does not seem to be easy at all.

Does |make mozmill| pass if you change
#if defined(XP_MACOSX) || defined(ANDROID)
at the top of nsNativeCharsetUtils.cpp to say
#if defined(XP_MACOSX) || defined(ANDROID) || defined(XP_UNIX)
?

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to