Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

Amit Langote Thu, 29 Oct 2020 20:09:24 -0700

On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
>
> Hi All,
>
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
>
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
>  convert
> ----------
>  \xefbc8d
> (1 row)
>
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.
>
> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"
>
> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
>  convert
> ---------
>  \x817c
> (1 row)
>
> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?


So we have

a1dd in euc_jp,
817c in sjis,
efbc8d in utf-8

that convert between each other just fine.

But when it comes to

e28892 in utf-8

it currently only converts to sjis and that too just one way:

select convert('\xe28892', 'utf-8', 'sjis');
 convert
---------
 \x817c
(1 row)

select convert('\x817c', 'sjis', 'utf-8');
 convert
----------
 \xefbc8d
(1 row)

I noticed that the commit a8bd7e1c6e02 from ages ago removed
conversions from and to utf-8's e28892, in favor of efbc8d, and that
change has stuck.  (Note though that these maps looked pretty
different back then.)

--- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
-  {0xa1dd, 0xe28892},
+  {0xa1dd, 0xefbc8d},

--- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
-  {0xe28892, 0xa1dd},
+  {0xefbc8d, 0xa1dd},

Can't tell what reason there was to do that, but there must have been
some.  Maybe the Japanese character sets prefer full-width hyphen
minus (unicode U+FF0D) over mathematical minus sign (U+2212)?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

Reply via email to