[GENERAL] invalid UTF-8 byte sequence detected

Markus Wollny Wed, 09 Nov 2005 11:49:55 -0800

Hello!

I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in 
this process I have stumbled a couple of times over certain errors in 
text-fields that lead to error-messages during import of the dump like these:


<2005-11-09 14:57:34 CET - 9354: [EMAIL PROTECTED]>ERROR:  invalid UTF-8 byte 
sequence detected near byte 0xb4
<2005-11-09 14:57:34 CET - 9354: [EMAIL PROTECTED]>CONTEXT:  COPY 
board_message, line 1125662, column text: "HI

        Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, 
da in beiden Addon▒s viel..."
<2005-11-09 14:57:34 CET - 9354: [EMAIL PROTECTED]>STATEMENT:  COPY 
board_message (board_id, thread_id, father_id, message_id, user_id, title, 
signat
ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, 
state_id, user_login, user_status, user_rank, user_rank_description, 
user_rank_picture, user_rights, text, deleted_user_id, deleted_date, 
deleted_login, user_created, poll_id, idxfti) FROM stdin;

<2005-11-09 14:57:49 CET - 9354: [EMAIL PROTECTED]>ERROR:  invalid UTF-8 byte 
sequence detected near byte 0x98
<2005-11-09 14:57:49 CET - 9354: [EMAIL PROTECTED]>CONTEXT:  COPY 
kidszone_tournament2005_user, line 427, column phone: "02302▒74"
<2005-11-09 14:57:49 CET - 9354: [EMAIL PROTECTED]>STATEMENT:  COPY 
kidszone_tournament2005_user (id, first_name, last_name, adress, birthday, 
phone,
 email, place, permission, ude, ude_box, invited) FROM stdin;

There are not too many occurrences of the same type - five altogether in a 
1.8GB compressed dumpfile, but still it has me worried and leaves me with some 
questions:

1.) How could I have prevented insertion of these invalid byte-sequences in the 
first place? We're using UTF-8 encoded databases, data is mostly inserted by 
users via browser applications, our websites are UTF-8 encoded, too, but still 
we cannot really make 100% sure that all clients behave as expected; on the 
other hand, it would be extremely inconvenient if we had to check each and 
every text input for character set conformance in the application, so is there 
a way to ascertain "sane" data via some database-setting? pg_restore does throw 
this error and indeed terminates after that (I used custom dump format for 
pg_dump), psql on the other hand just continues with the import (using a 
pgdumpall-output that generates a standard SQL-script), although it too throws 
the error.

2.) How does this really affect the value of the database-dumps? psql continues 
with import after the error, but the table where this error occurred remains 
empty, as the affected COPY-statement has failed altogether due to this error. 
So a plain no-worries import in my case would present me a result with five 
tables empty - one of them quite large... Is there some kind of magic, maybe 
involving some perl or whatever, that could help to clean up the dump before 
the import, so I can accomplish a full restore?

Kind regards,

   Markus

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

[GENERAL] invalid UTF-8 byte sequence detected

Reply via email to