On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma <ashu.coe...@gmail.com> wrote: > > Hi All, > > Today while working on some other task related to database encoding, I > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in > UTF-8. See below: > > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8'); > convert > ---------- > \xefbc8d > (1 row) > > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH > HYPHEN-MINUS SIGN. > > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is > converted to EUC-JP, the convert functions fails with an error saying: > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no > equivalent in encoding EUC_JP". See below: > > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp'); > ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8" > has no equivalent in encoding "EUC_JP" > > However, when the same MINUS SIGN in UTF-8 is converted to SJIS > encoding, the convert function returns the correct result. See below: > > postgres=# select convert('\xe28892', 'utf-8', 'sjis'); > convert > --------- > \x817c > (1 row) > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > MINUS SIGN in SJIS and that is what we expect. Isn't it?
So we have a1dd in euc_jp, 817c in sjis, efbc8d in utf-8 that convert between each other just fine. But when it comes to e28892 in utf-8 it currently only converts to sjis and that too just one way: select convert('\xe28892', 'utf-8', 'sjis'); convert --------- \x817c (1 row) select convert('\x817c', 'sjis', 'utf-8'); convert ---------- \xefbc8d (1 row) I noticed that the commit a8bd7e1c6e02 from ages ago removed conversions from and to utf-8's e28892, in favor of efbc8d, and that change has stuck. (Note though that these maps looked pretty different back then.) --- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map +++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map - {0xa1dd, 0xe28892}, + {0xa1dd, 0xefbc8d}, --- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map +++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map - {0xe28892, 0xa1dd}, + {0xefbc8d, 0xa1dd}, Can't tell what reason there was to do that, but there must have been some. Maybe the Japanese character sets prefer full-width hyphen minus (unicode U+FF0D) over mathematical minus sign (U+2212)? -- Amit Langote EDB: http://www.enterprisedb.com