On Fri, Oct 30, 2020 at 8:49 AM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > > Hello. > > At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma <ashu.coe...@gmail.com> > wrote in > > Hi All, > > > > Today while working on some other task related to database encoding, I > > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is > > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in > > UTF-8. See below: > > > > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); > > convert > > ---------- > > \xefbc8d > > (1 row) > > > > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN > > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH > > HYPHEN-MINUS SIGN. > > No it's not a bug, but a well-known "design":( > > The mapping is generated from CP932.TXT and JIS0212.TXT by > UCS_to_UEC_JP.pl. > > CP932.TXT used here is here. > > https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT > > CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows. > > 0x817C 0xFF0D #FULLWIDTH HYPHEN-MINUS >
We do have MINUS SIGN (U+2212) defined in both UTF-8 and EUC-JP encoding. So, not sure why converting MINUS SIGN from UTF-8 to EUC-JP should throw an error saying: "... in encoding UTF8 has *no* equivalent in EUC_JP". I mean this information looks misleading and that's I reason I feel its a bug. > > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is > > converted to EUC-JP, the convert functions fails with an error saying: > > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no > > equivalent in encoding EUC_JP". See below: > > > > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); > > ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" > > has no equivalent in encoding "EUC_JP" > > U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp > U+2212(e2 88 92) doesn't have a mapping between euc-jp. > > > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > > encoding, the convert function returns the correct result. See below: > > > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > > convert > > --------- > > \x817c > > (1 row) > > It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason > but maybe because it was used widely. > > So ping-pong between Unicode and SJIS behaves like this: > > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ... > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > > MINUS SIGN in SJIS and that is what we expect. Isn't it? > > I think we don't change authoritative mappings, but maybe can add some > one-way conversions for the convenience. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center