On Thu, Jan 16, 2014 at 7:28 PM, ISHIKAWA,chiaki <ishik...@yk.rim.or.jp> wrote: > (2014/01/16 12:22), Zack Weinberg wrote: >> >> On 2013-11-26 5:40 AM, Neil wrote: >>> >>> Henri Sivonen wrote: >>> >>>> On Windows, do we really need to pay homage to the pre-NT legacy when >>>> doing Save As? How about we just use UTF-8 for "HTML Page, complete" >>>> reserialization like on Mac? >>>> >>>> >>> You'll need a BOM, of course. >>> >>>> (MXR turns up so little that it makes me wonder non-UTF-8 support >>>> might have already gone away for practical purposes.) >>>> >>> Gecko gets a little confused when you run Linux on a non-UTF8 file >>> system, so I'd agree with this. >> >> >> FWIW some time later, I've seen both Linux and *BSD devs state that even >> if the user's locale uses a legacy encoding, all filesystem paths should >> still be treated as UTF-8. >> > > I don't know if my discovery below which seems to agree with what Zack found > is relevant here or not. > But here it goes. Seeing BOM in quoted Neil's post prompts me to write this.
I think Neil referred to a BOM inside serialized HTML files--not in file system paths. > I noticed that 64-bit DEBUG BUILD of TB > seems to fail a conversion of native char string vs unicode near the > beginning of its invocation. > Specifically, iconv() failed each time (and to my chagrin, > it did not seem to fail on a different 32-bit Debian GNU/Linux. Not sure > why. Maybe the default locale that I used during installation of linux may > be different. See below.) I noticed the error in |make mozmill| test suite > log. > > When I installed Debian GNU/Linux 64-bit, > I used Japanese locale : ja_JP.UTF-8 > > Since I refreshed source code of comm-central mid-December, > valgrind run of TB also reports strange illegal read of 4 byte > past allocated buffer in the code sequence that leads up to iconv() failure, > so I inserted many printf to dump data to see what is going on. Seems like a bug. > I found that TB generates during its execution > UTF-8 file path name strings WITHOUT BOM and > still contain supposedly a valid UTF8 path name. I'm pretty sure that file system paths on Linux are not supposed to contain a BOM. Besides, a UTF-8 BOM is three bytes and, therefore, doesn't explain an off-by-four error. > This seems to confuse the code in TB itself and iconv fails. > (And I think it is an expensive failure although |iconv| seems to be called > only a few times during TB invocation.) > > Case in point. > Linux GUI environment (or maybe it is common linux practice, I don't know) > has a file called > ~/.config/user-dirs.dirs > under one's home directory (~). > > And this file contains the mapping of Desktop, > Download, and host of other symbolic names used in popular Desktop > environment into one's L10N name that is shown > on the desktop, it seems: at least the file itself mentions this near > its beginning. > > For example, in my |user-dirs.dirs| file, I have the following. > Raw image: not sure how it will show up on people's screen. > > # This file is written by xdg-user-dirs-update > # If you want to change or add directories, just edit the line you're > # interested in. All local changes will be retained on the next run > # Format is XDG_xxx_DIR="$HOME/yyy", where yyy is a shell-escaped > # homedir-relative path, or XDG_xxx_DIR="/yyy", where /yyy is an > # absolute path. No other format is supported. > # > XDG_DESKTOP_DIR="$HOME/デスクトップ" > XDG_DOWNLOAD_DIR="$HOME/ダウンロード" > XDG_TEMPLATES_DIR="$HOME/テンプレート" > XDG_PUBLICSHARE_DIR="$HOME/公開" > XDG_DOCUMENTS_DIR="$HOME/ドキュメント" > XDG_MUSIC_DIR="$HOME/音楽" > XDG_PICTURES_DIR="$HOME/画像" > XDG_VIDEOS_DIR="$HOME/ビデオ" > > Output of hexdump -C user-dirs.dirs: > 00000000 23 20 54 68 69 73 20 66 69 6c 65 20 69 73 20 77 |# This file is > w| > 00000010 72 69 74 74 65 6e 20 62 79 20 78 64 67 2d 75 73 |ritten by > xdg-us| > 00000020 65 72 2d 64 69 72 73 2d 75 70 64 61 74 65 0a 23 > |er-dirs-update.#| > 00000030 20 49 66 20 79 6f 75 20 77 61 6e 74 20 74 6f 20 | If you want to > | > 00000040 63 68 61 6e 67 65 20 6f 72 20 61 64 64 20 64 69 |change or add > di| > 00000050 72 65 63 74 6f 72 69 65 73 2c 20 6a 75 73 74 20 |rectories, just > | > 00000060 65 64 69 74 20 74 68 65 20 6c 69 6e 65 20 79 6f |edit the line > yo| > 00000070 75 27 72 65 0a 23 20 69 6e 74 65 72 65 73 74 65 |u're.# > intereste| > 00000080 64 20 69 6e 2e 20 41 6c 6c 20 6c 6f 63 61 6c 20 |d in. All local > | > 00000090 63 68 61 6e 67 65 73 20 77 69 6c 6c 20 62 65 20 |changes will be > | > 000000a0 72 65 74 61 69 6e 65 64 20 6f 6e 20 74 68 65 20 |retained on the > | > 000000b0 6e 65 78 74 20 72 75 6e 0a 23 20 46 6f 72 6d 61 |next run.# > Forma| > 000000c0 74 20 69 73 20 58 44 47 5f 78 78 78 5f 44 49 52 |t is > XDG_xxx_DIR| > 000000d0 3d 22 24 48 4f 4d 45 2f 79 79 79 22 2c 20 77 68 |="$HOME/yyy", > wh| > 000000e0 65 72 65 20 79 79 79 20 69 73 20 61 20 73 68 65 |ere yyy is a > she| > 000000f0 6c 6c 2d 65 73 63 61 70 65 64 0a 23 20 68 6f 6d |ll-escaped.# > hom| > 00000100 65 64 69 72 2d 72 65 6c 61 74 69 76 65 20 70 61 |edir-relative > pa| > 00000110 74 68 2c 20 6f 72 20 58 44 47 5f 78 78 78 5f 44 |th, or > XDG_xxx_D| > 00000120 49 52 3d 22 2f 79 79 79 22 2c 20 77 68 65 72 65 |IR="/yyy", > where| > 00000130 20 2f 79 79 79 20 69 73 20 61 6e 0a 23 20 61 62 | /yyy is an.# > ab| > 00000140 73 6f 6c 75 74 65 20 70 61 74 68 2e 20 4e 6f 20 |solute path. No > | > 00000150 6f 74 68 65 72 20 66 6f 72 6d 61 74 20 69 73 20 |other format is > | > 00000160 73 75 70 70 6f 72 74 65 64 2e 0a 23 20 0a 58 44 |supported..# > .XD| > 00000170 47 5f 44 45 53 4b 54 4f 50 5f 44 49 52 3d 22 24 > |G_DESKTOP_DIR="$| > 00000180 48 4f 4d 45 2f e3 83 87 e3 82 b9 e3 82 af e3 83 > |HOME/...........| > 00000190 88 e3 83 83 e3 83 97 22 0a 58 44 47 5f 44 4f 57 > |.......".XDG_DOW| > 000001a0 4e 4c 4f 41 44 5f 44 49 52 3d 22 24 48 4f 4d 45 > |NLOAD_DIR="$HOME| > 000001b0 2f e3 83 80 e3 82 a6 e3 83 b3 e3 83 ad e3 83 bc > |/...............| > 000001c0 e3 83 89 22 0a 58 44 47 5f 54 45 4d 50 4c 41 54 > |...".XDG_TEMPLAT| > 000001d0 45 53 5f 44 49 52 3d 22 24 48 4f 4d 45 2f e3 83 > |ES_DIR="$HOME/..| > 000001e0 86 e3 83 b3 e3 83 97 e3 83 ac e3 83 bc e3 83 88 > |................| > 000001f0 22 0a 58 44 47 5f 50 55 42 4c 49 43 53 48 41 52 > |".XDG_PUBLICSHAR| > 00000200 45 5f 44 49 52 3d 22 24 48 4f 4d 45 2f e5 85 ac > |E_DIR="$HOME/...| > 00000210 e9 96 8b 22 0a 58 44 47 5f 44 4f 43 55 4d 45 4e > |...".XDG_DOCUMEN| > 00000220 54 53 5f 44 49 52 3d 22 24 48 4f 4d 45 2f e3 83 > |TS_DIR="$HOME/..| > 00000230 89 e3 82 ad e3 83 a5 e3 83 a1 e3 83 b3 e3 83 88 > |................| > 00000240 22 0a 58 44 47 5f 4d 55 53 49 43 5f 44 49 52 3d > |".XDG_MUSIC_DIR=| > 00000250 22 24 48 4f 4d 45 2f e9 9f b3 e6 a5 bd 22 0a 58 > |"$HOME/......".X| > 00000260 44 47 5f 50 49 43 54 55 52 45 53 5f 44 49 52 3d > |DG_PICTURES_DIR=| > 00000270 22 24 48 4f 4d 45 2f e7 94 bb e5 83 8f 22 0a 58 > |"$HOME/......".X| > 00000280 44 47 5f 56 49 44 45 4f 53 5f 44 49 52 3d 22 24 > |DG_VIDEOS_DIR="$| > 00000290 48 4f 4d 45 2f e3 83 93 e3 83 87 e3 82 aa 22 0a > |HOME/.........".| > 000002a0 > > > I am not sure what the byte '0xe3' is: this precedes each Japanese > character. Each Japanese character is 3 bytes in UTF-8. Since the code points of the Japanese characters are close together, they have the same most significant bits and, therefore, they end up with the same UTF-8 lead byte. > |file user-dirs.dirs| recognizes this file as utf-8 and prints > user-dirs.dirs: UTF-8 Unicode text > > In mozilla code, the following function parses the |.config/user-dirs.dir| > file and produces the really used path name on user desktop for |DESKTOP|, > etc. $HOME in the path string in user-dirs.dirs is expanded during the > process. This would produce strings for localized names (but without BOM). > > static char * > xdg_user_dir_lookup (const char *type) > in mozilla/xpcom/io/SpecialSystemDirectory.cpp That C function is not exactly in Mozilla's style... > xdg_user_dir_lookup() is called and the produced string is > then passed to NS_NewNativeLocalFile() inside the following function: > static nsresult > GetUnixXDGUserDirectory() > in mozilla/xpcom/io/SpecialSystemDirectory.cpp > > The strings produced by xdg_user_dir_lookup() is passed to > NS_NewNativeLocalFile() in xpcom/io/nsNativeCharsetUtils.cpp > and then it is converted to the so called NativeCharset for file name > which, I think, is UTF-8 under linux's ext4 file system. Yeah, "native" should be UTF-8 in this case. > But due to the way the strings are represented in > ~/.config/user-dirs.dirs without BOM, I think > iconv() from libc fails to produce a valid conversion, and > reports failure. > > For example, this is a log dump from my modified TB: > > GetSpecialSystemDirectory: aSystemSystemDirectory=304 > DEBUG:xdg_user_dir_lookup: type = DESKTOP <--- looking for > what is the file name for |DESKTOP| > DEBUG: relative = 1 > DEBUG: GetUnixXDGUserDirectory: dir = /home/ishikawa/デスクトップ > > xdg_user_dir_lookup generated the above string, but > without BOM. See the dump below. > > DEBUG: dir[00] = 0x2f <--- NO BOM > DEBUG: dir[01] = 0x68 > DEBUG: dir[02] = 0x6f > DEBUG: dir[03] = 0x6d > DEBUG: dir[04] = 0x65 > DEBUG: dir[05] = 0x2f > DEBUG: dir[06] = 0x69 > DEBUG: dir[07] = 0x73 > DEBUG: dir[08] = 0x68 > DEBUG: dir[09] = 0x69 > DEBUG: dir[10] = 0x6b > DEBUG: dir[11] = 0x61 > DEBUG: dir[12] = 0x77 > DEBUG: dir[13] = 0x61 > DEBUG: dir[14] = 0x2f > DEBUG: dir[15] = 0xe3 <-- mysterious 0xe3 > DEBUG: dir[16] = 0x83 > DEBUG: dir[17] = 0x87 > DEBUG: dir[18] = 0xe3 > DEBUG: dir[19] = 0x82 > DEBUG: dir[20] = 0xb9 > DEBUG: dir[21] = 0xe3 > DEBUG: dir[22] = 0x82 > DEBUG: dir[23] = 0xaf > DEBUG: dir[24] = 0xe3 > DEBUG: dir[25] = 0x83 > DEBUG: dir[26] = 0x88 > DEBUG: dir[27] = 0xe3 > DEBUG: dir[28] = 0x83 > DEBUG: dir[29] = 0x83 > DEBUG: dir[30] = 0xe3 > DEBUG: dir[31] = 0x83 > DEBUG: dir[32] = 0x97 > ERROR: errno=84 > (This is error EILSEQ (Illegal byte sequence) on my linux ) Weird. The above is a legitimate UTF-8 byte sequence. > [29655] WARNING: conversion from native to utf-16(?) failed: file > /REF-COMM-CENTRAL/comm-central/mozilla/xpcom/io/nsNativeCharsetUtils.cpp, > line 470 > (CI commet: the above reference to 'utf-16' may be bogus. I think I put this > message myself, and maybe this ought to be 'utf-8' instead.) No, UTF-16 is right. > ERROR: res = 0xffffffffffffffff, outLeft=36, inLeft=18 > [29655] ###!!! ASSERTION: iconv failed: 'Error', file > /REF-COMM-CENTRAL/comm-central/mozilla/xpcom/io/nsNativeCharsetUtils.cpp, > line 545 > > I think the naive assumption about inLeft*2 == outLeft may not be valid > here, also. (I think the valid output length is smaller than inLeft * 2 due > the mysterious 0xe3 octets.) Where does it assume that? AFAICT, the multiplications by two don't mix input and output but are there to count UTF-16 lengths in bytes, because the inconv API uses byte buffers even for UTF-16. > I think TB eventually figures out the file pathname by sheer > heuristics: see the complex processing in nsNativeCharsetUtils.cpp. > > Anyway, some inconsistent processing language coding choice, and > the particular internal format used for supposedly UTF-8 file path name, has > made the code in nsNAtiveCharsetUtils.cpp very complex and still iconv() > fails. All this use of iconv is sad, yes. I wouldn't be opposed to dropping the iconv code paths and using the OS X / Android code (that assumes that operating system's file system APIs always take UTF-8) for all *nix platforms. I wonder if before this code runs iconv_t nsNativeCharsetConverter::gNativeToUnicode really gets initialized to a value that means conversion from UTF-8 to machine-endian UTF-16 and iconv_t nsNativeCharsetConverter::gUnicodeToNative really gets initialized to a value that means machine-endian UTF-16 to UTF-8. We have code that switches the process-wide C library locale back and forth (terrible idea!), so one angle to investigate is if some part of Gecko overwrites the process locale before nsNativeCharsetConverter::LazyInit() that initializes these iconv_t globals has a chance to see that the system locale is an UTF-8 locale. You could try changing the line const char *native_charset = nl_langinfo(CODESET); to const char *native_charset = "UTF-8"; in nsNativeCharsetConverter::LazyInit(). If this fixes the problem, the cause is probably another part of the code base overwriting what nl_langinfo(CODESET) needs to read before nl_langinfo(CODESET) gets its chance to run. > Maybe the old code donated by RedHat, that is > xdg_user_dir_lookup, may need fixing. > This function also seems to generates a buffer area one byte too short > during run-time due to the mysterious 0x3e. I am still not sure whether > valgrind warning is > for real or false-positive. > > The invalid 4 byte read reported by valgrind seems to be a > false-positive: code produced by the latest GCC for strlen seems to > use WORD load to analyze a sequence of bytes for NUL termination and > valgrind warns when the end of WORD is after the end of the previously > malloced area. libc authors and gcc developers seem to claim that > the WORD read is used only when the last portion of word falls between > a 16 byte region or something (and malloc usually returns an area of a size > rounded up to 16 and thus such word read is safe.) Sounds like a false positive, then. > Right, this post may NOT be directly related to non-UTF-8 support, but > the some issues for native <==> UTF-8 conversion handling which I found over > the last few months need mentioning and recording for future maintainer of > TB. Although file path name is supposed to be UTF-8 these days across > platforms, some > support infrastructure may not pass around quite valid UTF-8 strings among > themselves although the processing may end up correctly after heuristics. > iconv() failure was quite annoying in |make mozmill| test suite log, and is > still there, but fixing it does not seem to be easy at all. Does |make mozmill| pass if you change #if defined(XP_MACOSX) || defined(ANDROID) at the top of nsNativeCharsetUtils.cpp to say #if defined(XP_MACOSX) || defined(ANDROID) || defined(XP_UNIX) ? -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform