From: Frank Lichtenheld <[EMAIL PROTECTED]> Subject: Bug#227273: www.debian.org: Japanese DDTP files are provided with EUC-JP endoding. Date: Thu, 29 Jan 2004 01:25:26 +0100
> On Thu, Jan 29, 2004 at 12:16:19AM +0100, Frank Lichtenheld wrote: > > I used the Perl module Text::Iconv which itself uses iconv(3) > > This module seems to suck or I am to dump to use it. If I convert the > > raw Japanese Packages file with iconv(1) (which probably uses iconv(3), > > too) all escape sequences seem to be generated correctly, if I use > > Text::Iconv->convert, only the very first one is. > > Correction, it only forgets the very last escape sequence since this one is > not generated by iconv(3). It "forgets" to clear the state at the end of > the conversion which I found out in comaprison with iconv(1) that handles > this case correctly. I prepared a patch and will file a bug against the > package. I tested on gluck (packages.debian.org): (a) $ echo -en '\xa4\xa2' | iconv -f EUC-JP -t ISO-2022-JP | od -t x1 0000000 1b 24 42 24 22 1b 28 42 0000010 The last three bytes is the closing escape sequence. Thus iconv(1) works well. Next, I wrote the following script: (b) #!/usr/bin/perl use Text::Iconv; $conv = Text::Iconv->new("EUC-JP", "ISO-2022-JP"); $a=""; while(<>){ $a .= $_; } $b = $conv->convert($a); print $b; Then (c) $ echo -ne '\xa4\xa2' | ./a.pl | od -t x1 0000000 1b 24 42 24 22 0000005 In this case, closing escape sequence is missing. However, if the source string has some following characters after JIS X 0208 Japanese characters, like: (d) $ echo -e '\xa4\xa2' | ./a.pl |od -t x1 0000000 1b 24 42 24 22 1b 28 42 0a 0000011 (e) $ echo -ne '\xa4\xa2\x41' | ./a.pl |od -t x1 0000000 1b 24 42 24 22 1b 28 42 41 0000011 Then the closing escape sequence is added. Explanation: In the case of (e), it is clear that closing escape sequence is needed. In case of (d), it is also needed because ISO-2022-JP requires that when Line Feed appears the "state" must be ASCII. In case of (c), Text::Iconv does not know whether the following string will be Japanese or ASCII. Addition of closing escape sequence would be redundant if Japanese would follow. I imagine this is why Text::Iconv does not add closing escape sequence in this case. I think the safest way is to use Text::Iconv to convert the whole web page at one time. (Or, at least the whole line (logical line which ends with Line Feed code) at one time.) --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://www.debian.or.jp/~kubota/