Its UTF-8. Also verified the load file and its utf-8. Regards, Kiran
On Fri, Jan 12, 2024 at 10:48 PM Adrian Klaver <adrian.kla...@aklaver.com> wrote: > On 1/12/24 07:23, Kiran K V wrote: > > Hi, > > > > > > I have a UTF8 database and simple table with two columns (integer and > > varchar). Created a csv file with some multibyte characters and trying > > to perform load operation using the copy command. > > The multibyte characters come from what character set? > > > > > __ __ > > > > Database info:____ > > > > Postgresql database details:____ > > > > Name | Owner | Encoding | Collate | > > Ctype | Access privileges____ > > > > > -----------+----------+----------+--------------------+--------------------+-----------------------____ > > > > postgres | postgres | UTF8 | English_India.1252 | > > English_India.1252 |____ > > > > __ __ > > > > (Note: I also tried with collate utf8 and no luck) > > > > > > postgres=# set client_encoding='UTF8';____ > > > > SET____ > > > > __ __ > > > > Table:____ > > > > create table public.test ( PKCOL integer not null, STR1 character > > varying(64) null, primary key( PKCOL )) ____ > > > > ____ > > > > csv contents:____ > > > > 1|"àáâãäåæçèéêëìíîï"____ > > > > __ __ > > > > After data loading, actual data is becoming____ > > > > à áâãäåæçèéêëìÃîï____ > > > > hex of this is - > > c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af____ > > > > __ __ > > > > The hex values are indeed the UTF-8 encodings of the characters in your > > expected string, and the presence of `C2` before each character is > > indicative of how UTF-8 represents certain characters.____ > > > > In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`, > > etc.) are represented as two bytes. The first byte `C2` or `C3` > > indicates that this is a two-byte character, and the second byte > > specifies the character. For example:____ > > > > - `à` is represented as `C3 A0`____ > > > > - `á` is `C3 A1`____ > > > > - `â` is `C3 A2`, and so on.____ > > > > In this case, the `C2` byte is getting interpreted as a separate > > character and that is the likely reason that an `Â` (which corresponds > > to `C2`) is seen before each intended character. Looks like UTF-8 > > encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or > > Windows-1252, where each byte is treated as a separate character. > > > > > > Please advise. Thank you very much. > > > > > > Regards, > > > > Kiran > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com > >