[GENERAL] R: Chars problem restoring to ps 8.4 (utf8) a dumped db from ps 8.1 (latin9)

Bianchi Quota Leonardo Wed, 19 Aug 2015 08:33:21 -0700

Hi, surely I will upgrade to 9.4.4! I already downloaded the rpms for the 
update to postgres 9.4.4 but I thought not to update before getting through 
this matter if update is not a prerequisite for the solution.

Answering to Tom's last post, I checked that Bugzilla 3.2 (an old installation 
of Bugzilla) was set to " Use UTF-8 (Unicode) encoding for all text in 
Bugzilla".

Today I did a test, trying to give more details, and I hope this can help to 
answer this question (which, if I understood well, is the point):
Does bugzilla regardless the database charset definition write data using UTF8?
(In the test I do stuff on Bugzilla 5.0 (the last stable release) instead of 
Bugzilla 3.2 (which is my running application) because for now I don't want to 
do tests in the production environment)
Then I think it would be very helpful to know if this behavior in general 
confirms Tom's thoughts.

---------------------TEST--------------------------------
On the new db, created in this way via psql: CREATE DATABASE bugsl9test with 
owner bugs ENCODING 'LATIN9' TEMPLATE template0 LC_COLLATE 'C' LC_CTYPE 'C';
I added two bugs. One setting bugzilla with "utf8":"0" and the other setting 
"utf8":"1" (1 means use utf8).
In both cases I wrote the char "è" in the field "Summary" of the web form. The 
result is that the value in the field of the short_desc column of "bugs"  table 
of the specific bug row, viewed via pgadminIII on a windows 7 is "Ãš" ,
but in the first case (Utf8:"0") bugzilla shows (I use chrome) for both of the 
two bugs an "Ã¨" and in the second case (utf8:"1") shows "Ãš" CORRECTLY as "è".
-----------------------------------------------------------

Actually the whole note about setting utf8 to "1" or to "0" is: "Use UTF-8 
(Unicode) encoding for all text in Bugzilla. New installations should set this 
to true to avoid character encoding problems.
Existing databases should set this to true only after the data has been 
converted from existing legacy character encodings to UTF-8, using the 
contrib/recode.pl script."

Recode.pl (https://github.com/bugzilla/bugzilla/blob/master/contrib/recode.pl) 
is an utility which converts a database from one encoding (or multiple 
encodings) to UTF-8 and I, in a previous test, run recode.pl to convert the 
data dumped as latin9 (of course editing the "client_encoding" from latin9 to 
utf8) and then no "strange chars" were shown after restoring in the new utf8 
database.

Thank you very much for your attention and patience!

Bye,
Leonardo

-----Messaggio originale-----
Da: Tom Lane [mailto:t...@sss.pgh.pa.us]
Inviato: giovedì 13 agosto 2015 16:39
A: Martín Marqués
Cc: Adrian Klaver; Bianchi Quota Leonardo; 'pgsql-general@postgresql.org'
Oggetto: Re: [GENERAL] Chars problem restoring to ps 8.4 (utf8) a dumped db 
from ps 8.1 (latin9)

"=?UTF-8?Q?Mart=c3=adn_Marqu=c3=a9s?=" <martin.marq...@gmail.com> writes:
> El 12/08/15 a las 11:12, Tom Lane escribió:
>> It does not seem likely to me that this would work at all.  You're
>> taking a dump file that is full of LATIN9 data and simply asserting
>> that it's
>> UTF8 data.  That doesn't make it so.  If it seemed to work, maybe
>> that's because your editor changed the encoding?  Not to be relied on, for 
>> sure.

> Well, IIRC a LATIN9 encoding char which is interpreted as UTF8 will
> get inserted with no error on a UTF8 server (although the final data
> will be bogus).

I'd believe the other way around: if you tell the database that you're using 
LATIN9, but what you send is really UTF8, it will not reject it because the 
individual bytes are perfectly valid LATIN9 characters and there are no 
cross-byte checks to make in LATIN9.  But it seems highly unlikely that 
LATIN9-encoded data would get past the UTF8 validity checker with any 
consistency.

It's possible that the problem is one of mislabeling, ie the database was 
claimed to use LATIN9 but what was actually sent was always UTF8.
If that was *always* the case then the OP's fix of changing the label in the 
dump file was actually the right thing to do.  But we haven't been given enough 
information to be sure of that --- and if that's what was happening, then some 
client software fixes would be in order anyway, because the client code was 
using the wrong client_encoding.

                        regards, tom lane
AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute nel 
messaggio o nei suoi allegati. Se non siete i destinatari indicati nel 
messaggio, o responsabili per la sua consegna alla persona, o se avete ricevuto 
il messaggio per errore, siete pregati di non trascriverlo, copiarlo o inviarlo 
a nessuno. In tal caso vi invitiamo a cancellare il messaggio ed i suoi 
allegati. Grazie. CONFIDENTIALITY NOTICE Confidential information may be 
contained in this message or in its attachments. If you are not the addressee 
indicated in this message, or responsible for message delivering to that 
person, or if you have received this message in error, you may not transcribe, 
copy or deliver this message to anyone. In that case, you should delete this 
message and its attachments. Thank you.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] R: Chars problem restoring to ps 8.4 (utf8) a dumped db from ps 8.1 (latin9)

Reply via email to