Re: [BUGS] BUG #5532: Valid UTF8 sequence errors as invalid

2010-07-06 Thread Dimitri Fontaine
Tom Lane writes: > (BTW, I should think that iconv or some related tool would have a > solution for fixing this miscoding; it's not an uncommon problem.) I guess recode is handling that. http://recode.progiciels-bpi.ca/manual/Universal.html#Universal Regards, -- dim -- Sent via pgsql-bugs

Re: [BUGS] BUG #5532: Valid UTF8 sequence errors as invalid

2010-06-30 Thread Tom Lane
Mike Lewis writes: > I've run into a fair amount of unicode errors when trying to copy in log > files. Would you recommend using bytea or another data type instead of text > or varchar... or at least copying to a staging table with bytea's and > filtering out invalid rows when moving it to the ma

Re: [BUGS] BUG #5532: Valid UTF8 sequence errors as invalid

2010-06-30 Thread Mike Lewis
> > > > It is not valid. See http://tools.ietf.org/html/rfc3629 --- a sequence > beginning with ED must have a second byte in the range 80-9F to be > legal, and this doesn't. The example you give would decode as U+DF2D, > ie part of a surrogate pair, which is specifically disallowed in UTF8 > ---

Re: [BUGS] BUG #5532: Valid UTF8 sequence errors as invalid

2010-06-30 Thread Tom Lane
"Michael Lewis" writes: > I'm using Python to sanitize my logs from invalid UTF8 characters before > COPYing them into postgres. I came across this one sequence that seems to > be valid UTF8 (in the extended range I believe). It is not valid. See http://tools.ietf.org/html/rfc3629 --- a sequenc

[BUGS] BUG #5532: Valid UTF8 sequence errors as invalid

2010-06-30 Thread Michael Lewis
The following bug has been logged online: Bug reference: 5532 Logged by: Michael Lewis Email address: mikelikes...@gmail.com PostgreSQL version: 9.0 trunk Operating system: OS X Description:Valid UTF8 sequence errors as invalid Details: I'm using Python to sanitize