Re: [GENERAL] invalid UTF-8 byte sequence detected

Bruce Momjian Tue, 15 Nov 2005 10:50:57 -0800

I am also confused how invalid UTF8 sequences got into your database. 
It shouldn't have been possible.


---------------------------------------------------------------------------

Markus Wollny wrote:
> Hello!
> 
> I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; 
> in this process I have stumbled a couple of times over certain errors in 
> text-fields that lead to error-messages during import of the dump like these:
> 
> <2005-11-09 14:57:34 CET - 9354: [EMAIL PROTECTED]>ERROR:  invalid UTF-8 byte 
> sequence detected near byte 0xb4
> <2005-11-09 14:57:34 CET - 9354: [EMAIL PROTECTED]>CONTEXT:  COPY 
> board_message, line 1125662, column text: "HI
> 
>         Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden 
> kaufe, da in beiden Addon?s viel..."
> <2005-11-09 14:57:34 CET - 9354: [EMAIL PROTECTED]>STATEMENT:  COPY 
> board_message (board_id, thread_id, father_id, message_id, user_id, title, 
> signat
> ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, 
> state_id, user_login, user_status, user_rank, user_rank_description, 
> user_rank_picture, user_rights, text, deleted_user_id, deleted_date, 
> deleted_login, user_created, poll_id, idxfti) FROM stdin;
> 
> <2005-11-09 14:57:49 CET - 9354: [EMAIL PROTECTED]>ERROR:  invalid UTF-8 byte 
> sequence detected near byte 0x98
> <2005-11-09 14:57:49 CET - 9354: [EMAIL PROTECTED]>CONTEXT:  COPY 
> kidszone_tournament2005_user, line 427, column phone: "02302?74"
> <2005-11-09 14:57:49 CET - 9354: [EMAIL PROTECTED]>STATEMENT:  COPY 
> kidszone_tournament2005_user (id, first_name, last_name, adress, birthday, 
> phone,
>  email, place, permission, ude, ude_box, invited) FROM stdin;
> 
> There are not too many occurrences of the same type - five altogether in a 
> 1.8GB compressed dumpfile, but still it has me worried and leaves me with 
> some questions:
> 
> 1.) How could I have prevented insertion of these invalid byte-sequences in 
> the first place? We're using UTF-8 encoded databases, data is mostly inserted 
> by users via browser applications, our websites are UTF-8 encoded, too, but 
> still we cannot really make 100% sure that all clients behave as expected; on 
> the other hand, it would be extremely inconvenient if we had to check each 
> and every text input for character set conformance in the application, so is 
> there a way to ascertain "sane" data via some database-setting? pg_restore 
> does throw this error and indeed terminates after that (I used custom dump 
> format for pg_dump), psql on the other hand just continues with the import 
> (using a pgdumpall-output that generates a standard SQL-script), although it 
> too throws the error.
> 
> 2.) How does this really affect the value of the database-dumps? psql 
> continues with import after the error, but the table where this error 
> occurred remains empty, as the affected COPY-statement has failed altogether 
> due to this error. So a plain no-worries import in my case would present me a 
> result with five tables empty - one of them quite large... Is there some kind 
> of magic, maybe involving some perl or whatever, that could help to clean up 
> the dump before the import, so I can accomplish a full restore?
> 
> Kind regards,
> 
>    Markus
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Re: [GENERAL] invalid UTF-8 byte sequence detected

Reply via email to