On Sun, Apr 3, 2011 at 12:49 PM, Glenn Maynard <gl...@zewt.org> wrote:
> After dumping a database (pg_dump -F c database > dump), trying to restore > it (pg_restore dump) gives: > > > pg_restore: [archiver (db)] Error from TOC entry 2463; 0 58451 TABLE DATA > table user > > pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence > for encoding "UTF8": 0xe3273a > > HINT: This error can also happen if the byte sequence does not match the > encoding expected by the server, which is controlled by "client_encoding". > > CONTEXT: COPY table, line 1 > > The surface reason for this is clear enough: invalid UTF-8 data crept into > some tsvector columns. This is a much more serious problem, however: the > backup tools for the database are, without warning, generating data that > can't be restored. > > When in a data recovery situation, a backup that won't restore is > catastrophic. I can't restore the database to a state it was in at the time > of the backup; I have to spend hours of downtime figuring out what to do to > make something usable out of my backup; and then I have to hope I've > corrected the backup correctly before bringing the server back online. (If > I was in an actual backup recovery situation--fortunately I'm not--I'd be > more inclined to edit the Postgresql source to disable this check while > restoring the backup than to risk trying to manually fix the backup data > directly, which is very easy to get wrong.) > Two questions come to my mind on reading this. 1st, are you certain that the database you're restoring to has the same encoding as the source database? I'd have a hard time considering it an error if a dump from a database with SQL_ASCII or some non-unicode encoding failed to restore to a UTF8 encoded database in this manner, for example. And 2nd, does specifying the client encoding when making the backup, with 'pg_dump -E UTF8', produce behavior closer to what you would have expected? -Eric