Re: [GENERAL] Why is an ISO-8859-8 database allowing values not within that set?

Herouth Maoz Sun, 22 Jul 2012 00:59:59 -0700

Thanks. That makes sense. The default client encoding on the reports database 
is ISO-8859-8, so I guess when I don't set it using \encoding, it does exactly 
what you say.


OK, so I'm still looking for a way to convert illegal characters into something 
that won't collide with my encoding (asterisks or whatever).

Thank you,
Herouth


On 21/07/2012, at 15:36, Craig Ringer wrote:

> On 07/21/2012 04:59 PM, Herouth Maoz wrote:
>> I am using Postgresql 8.3.14 on our reporting system. There are scripts that 
>> collect data from many databases across the firm into this database. 
>> Recently I added tables from a particular database which has encoding UTF-8.
>> 
> 
> First, I know there have been encoding and UTF-8 handling fixes since 8.3 . 
> It'd be interesting to see if this still happens on a more recent version.
> 
> You're also missing five bug-fix point-releases in the 8.3 series, as the 
> latest is 8.3.19 . See:
>            http://www.postgresql.org/docs/8.3/static/release.html
> for fixes you're missing.
> 
> Explanation for what I think is going on below:
> 
>> 
>> But this puzzles me, because I then took the file
>> 
> ... which was created with a \copy with client encoding set to utf-8, right?
>> ran psql and \copy <table> from file
>> 
> 
> With which client encoding set? UTF-8 or ISO_8859_8? I bet you copied it in 
> with ISO_8859_1.
>> And it worked. I tried it again now, and I can see the row with its Arabic 
>> content, even though it is not in the database encoding.
> It shows up correctly?
> 
> If you \copy a dump in utf-8, then \copy it back in with ISO_8859_8, it 
> should be mangled.
> 
> If you set your client_encoding to utf_8 ("\encoding utf-8") does it still 
> show up correctly? I suspect it's wrong in the database and you're just 
> unmangling it on display. 
> 
> It would help if you would actually show the bytes of:
> 
> - The chars in the \copy dump, using `xxd' or similar
> - The chars in the database before the copy out and copy in, using a CAST to 
> `bytea`
> - The chars in the database AFTER the copy out and copy in, again with a CAST 
> to `bytea`
> 
> ... as well as the database encoding, NOT just the client encoding (see 
> below):
>> I checked \encoding. It replies
>> ISO_8859_8
> That is the client encoding.
> 
> Try:
> 
>    \l+
> 
> to list databases. You'll see the database encoding there. The same info is 
> available from:
> 
>   SELECT datname, encoding from pg_database WHERE datname = 'mydatabase';
> 
> 
> Maybe this demo will help enlighten you.
> 
> regress=# select version();                                                   
>  version                                                    
> -------------------------------------------------------------------------------------------------------------
>   PostgreSQL 9.1.4 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.7.0 
> 20120507 (Red Hat 4.7.0-5), 64-bit (1 row) regress=# \l+                      
>                                                             List of databases 
>         Name        |   Owner    | Encoding |   Collate   |    Ctype    |     
> Access privileges     |  Size   | Tablespace |                Description     
>              
> --------------------+------------+----------+-------------+-------------+---------------------------+---------+------------+--------------------------------------------
>   ....  regress            | craig      | UTF8     | en_US.UTF-8 | 
> en_US.UTF-8 |                           | 41 MB   | pg_default |  regress=# 
> CREATE TABLE enctest (a serial, x text); CREATE TABLE regress=# -- Some text 
> randomly pulled off Google News taiwan, since it was convenient and the exact 
> text doesn't matter regress=# insert into enctest (x) values ('心情已平復很多」。 
> 中國網絡電視台報導'); INSERT 0 1 regress=# \encoding UTF8 regress=# \copy ( select x 
> from enctest ) to enctest.csv Set a 1-byte non-utf encoding, doesn't really 
> matter which one. Then import the data we dumped as utf-8. regress=# 
> \encoding latin-1 regress=# \copy enctest(x) from enctest.csv enctest now 
> contains two rows. One is the correctly encoded original, one is the dumped 
> and reloaded one. We can't view the whole table while we're in latin-1 
> encoding because the correct row won't translate right. regress=# select * 
> from enctest; ERROR:  character 0xe5bf83 of encoding "UTF8" has no equivalent 
> in "LATIN1" but we *CAN* view the second row we dumped as utf-8 then imported 
> as latin-1: regress=# regress=# select * from enctest where a = 2;  a |       
>                      x                             
> ---+---------------------------------------------------------        2 | 
> 心情已平復很多」。 中國網絡電視台報導 (1 row) regres At this point you're probably thinking 
> "WTF!?!". It shows up correctly in my terminal because my terminal is utf-8, 
> irrespective of the encoding set in psql. Setting a non-utf-8 encoding in 
> psql via "\encoding" just lies to psql about the encoding of the bytes I 
> paste in on my terminal. It receives a byte sequence that could be valid 
> latin-1- though it's actually nonsense garbage, it can't tell the difference. 
> It trusts me and translates the "latin-1" I sent into utf-8 for storage. When 
> I ask for it back again and I'm in a latin-1 client encoding, it converts 
> that utf-8 back into latin-1 - or that's what it thinks it's doing. It's 
> actually demangling mangled utf-8 so my console can display it. We can 
> undestand this better if we examine what's actually in the database. 
> regress=# \encoding utf-8 regress=# select * from enctest;  a |               
>                                                         x                     
>                                               
> ---+------------------------------------------------------------------------------------------------------------------------------------
>   1 | 心情已平復很多」。 中國網絡電視台報導  2 | 
> å¿\u0083æ\u0083\u0085å·²å¹³å¾©å¾\u0088å¤\u009Aã\u0080\u008Dã\u0080\u0082 
> ä¸å\u009C\u008Bç¶²çµ¡é\u009B»è¦\u0096å\u008F°å ±å°\u008E (2 rows) regress=# 
> select a, x::bytea from enctest;  a |                                         
>                                                                      x        
>                                                                               
>                                  
> ---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  --------------------------------------------------  1 | 
> \xe5bf83e68385e5b7b2e5b9b3e5bea9e5be88e5a49ae3808de3808220e4b8ade59c8be7b6b2e7b5a1e99bbbe8a696e58fb0e5a0b1e5b08e
>   2 | 
> \xc3a5c2bfc283c3a6c283c285c3a5c2b7c2b2c3a5c2b9c2b3c3a5c2bec2a9c3a5c2bec288c3a5c2a4c29ac3a3c280c28dc3a3c280c28220c3a4c2b8c2adc3a5c29cc28bc3a7c2b6c2b2c3a7c2b5c2a1c3a9c29bc2b
>  bc3a8c2a6c296c3a5c28fc2b0c3a5c2a0c2b1c3a5c2b0c28e (2 rows) This should help. 
> See how the first row shows up correctly when we're in the right client 
> encoding, but the second one is gibberish? That's what UTF-8 that's been 
> interpreted as latin-1 and "converted" into utf-8 looks like. It's even more 
> informative if I start a terminal in latin-1 and then set client_encoding to 
> latin-1, so I'm not lying to psql about my client encoding.  $ 
> LANG=en_AU.iso55891 LC_ALL=en_AU.iso88591 xterm $ psql regress regress=# 
> \encoding LATIN-1 regress=# select x from enctest where a = 2;                
>               x                             
> ---------------------------------------------------------  å¿æ          ã 
> ä¸åç¶²çµ¡é¦å°å ±å° (1 row)¾©å¾å¤ã Note that latin-1 is a synonym for 
> iso-8859-1. Now you can see that the data in the DB is actually mangled. It's 
> just that when you accidentally lie to Pg the same way in the input and 
> output phases as I've shown, it *looks* ok though the byte sequence in the 
> database is garbage. Here's yet another way to illustrate it. Let's take the 
> first char of our input and see what data is produced when we convert its 
> utf-8 byte using a conversion function from latin-1 into utf-8. regress=# 
> select '心'::bytea, convert( '心'::bytea, 'latin-1', 'utf-8') from enctest;   
> bytea   |    convert      ----------+----------------  \xe5bf83 | 
> \xc3a5c2bfc283 (1 row) Look familiar from the example above? See how the 
> valid utf-8 sequence in the first col gets converted into garbage in the 
> second col? Yet we can reverse the incorrect conversion to get valid utf-8 
> again: regress=# select convert( BYTEA '\xc3a5c2bfc283', 'utf-8', 'latin-1'); 
>  convert   ----------  \xe5bf83 (1 row) ... which is probably what you've 
> been doing.
>> 
>> What's happening here? Why does the database accept input in the wrong 
>> encoding and doesn't shout when I then try to select that input?
>> 
> Correct, because legacy 1-byte encodings cannot be verified. There's no way 
> to say "Yup, this is latin-1" or "Yup, this is ISO-8859-8".
> 
> Don't set client_encoding unless you REALLY know encodings. If you do set it 
> for \copy, make sure you always \copy in with the same encoding you used for 
> the \copy out. PostgreSQL cannot protect you from this.
> 
> --
> Craig Ringer
> 



--
חרות מעוז
יוניסל פתרונות סלולריים מתקדמים
☎ 03-5181717 שלוחה 742

Re: [GENERAL] Why is an ISO-8859-8 database allowing values not within that set?

Reply via email to