Re: [GENERAL] invalid UTF-8 byte sequence detected

Markus Wollny Wed, 16 Nov 2005 02:25:11 -0800

Hi! 

> -----Ursprüngliche Nachricht-----
> Von: Bruce Momjian [mailto:[EMAIL PROTECTED] 
> Gesendet: Dienstag, 15. November 2005 19:46
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] invalid UTF-8 byte sequence detected
> 
> 
> I am also confused how invalid UTF8 sequences got into your database. 
> It shouldn't have been possible.
>


Our databases were originally encoded in ASCII, though they should have been 
LATIN9 - or UNICODE; this has been remedied a long time ago using iconv on the 
dumps; our PostgreSQL-version then was 7.4 and we converted databases and dumps 
to UTF-8. Maybe the invalid byte sequences have been entered sometimes during 
our migration from ODBC to JDBC while our encoding was still a mess - though I 
would have thought that this should have been resolved by dump&iconv&restore 
then. However, I do suspect that the cause of the issue was really more or less 
a bug in PostgreSQL <8.1, which accepted certain illegal byte sequences. I our 
case, I found that the re-import of the dump errored out on ISO-8859-1 encoded 
backticks (´) - certain mournfully misled individuals do use this 
"degu"-character instead of the apostrophe even tough it's more difficult to 
type on a german keyboard layout. And quite wrong, too.

Anyway, I found some reference in the hackers-list that encoding-consistency 
for Unicode has been tightened down (see 
http://archives.postgresql.org/pgsql-hackers/2005-10/msg00972.php ). Both a 
solution and a suggestion have been posted in this thread; Christopher 
Kings-Lynne has suggested to include a reference to this issue in the 
'upgrading/back compatibiliy' section for these release notes - I do strongly 
second his suggestion :)

The suggested solution was to feed the plain dump again through iconv; however 
at least on my systems (Debian Sarge) iconv didn't like my >5GB sized dump 
files. So in order to successfully reimport the dumps, I had to "split 
--line-bytes=650m" the SQL-file, pass the parts through iconv -c -f UTF8 -t 
UTF8 and concatenate them back into one file again. There were no more errors 
on feeding the dump back into psql and I didn't come across any missing data 
during my tests, so this has definitely done the trick for me.

As 8.1 has tightened down encoding-consistency for Unicode, I believe that the 
databases should be safe from any illegal byte-sequences in text-fields from 
now on. 

Kind regards

   Markus

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [GENERAL] invalid UTF-8 byte sequence detected

Reply via email to