Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-11-01 Thread Paul Lindner
On Sun, Oct 30, 2005 at 11:49:41AM -0500, Gregory Maxwell wrote: > On 10/26/05, Christopher Kings-Lynne <[EMAIL PROTECTED]> wrote: > > > iconv -c -f UTF8 -t UTF8 > > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql > > I've got a file with characters that pg won't accept that recode does > not f

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-30 Thread Gregory Maxwell
On 10/26/05, Christopher Kings-Lynne <[EMAIL PROTECTED]> wrote: > > iconv -c -f UTF8 -t UTF8 > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql I've got a file with characters that pg won't accept that recode does not fix but iconv does. Iconv is fine for my application, so I'm just posting to t

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-27 Thread Andrew - Supernews
On 2005-10-27, Paul Lindner <[EMAIL PROTECTED]> wrote: > On Mon, Oct 24, 2005 at 05:07:40AM -, Andrew - Supernews wrote: >> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9 >> was never actually a valid utf-8 string, and that the d2 b9 is only valid >> by coincidence (it'

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-26 Thread jtv
Andrej Ricnik-Bay wrote: > How about an ugly kludge ... > > split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile > for i in `ls -1 dumpfile*`; do iconv -c -f UTF8 -t UTF8 $i;done > cat dumpfile* > new_dump Not with UTF-8... You might break in the middle of a multibyte character. Jeroen

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-26 Thread Christopher Kings-Lynne
However I'm running into another problem now. The command: iconv -c -f UTF8 -t UTF8 does strip out the invalid characters. However, iconv reads the entire file into memory before it writes out any data. This is not so good for multi-gigabyte dump files and doesn't allow for it to be used

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-26 Thread Andrej Ricnik-Bay
> does strip out the invalid characters. However, iconv reads the > entire file into memory before it writes out any data. This is not so > good for multi-gigabyte dump files and doesn't allow for it to be used > in a pipe between pg_dump and psql. > > Anyone have any other recommendations? GNU

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-26 Thread Paul Lindner
On Mon, Oct 24, 2005 at 05:07:40AM -, Andrew - Supernews wrote: > > I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9 > was never actually a valid utf-8 string, and that the d2 b9 is only valid > by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0 >

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-23 Thread Andrew - Supernews
On 2005-10-24, Paul Lindner <[EMAIL PROTECTED]> wrote: > Here's a cut and paste from emacs hexl-mode: > > : 3530 3833 6335 3038 330a 3c20 5641 4c55 5083c5083.< VALU > 0010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4 ES (202553, '... > 0020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056 ..

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-23 Thread Christopher Kings-Lynne
Thanks go out to John Hansen, he recommended to run the dump through iconv: iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql This seems to strip out invalid UTF8 and will allow for a clean import. Someone should add this to the Release Notes/FAQ.. Yes I think that's extremely important to put

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-23 Thread Paul Lindner
On Sun, Oct 23, 2005 at 05:56:50AM -, Andrew - Supernews wrote: > On 2005-10-22, Paul Lindner <[EMAIL PROTECTED]> wrote: > > I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to > > restore these results in > > > > Invalid UNICODE byte sequence detected near byte ... > > What w

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-22 Thread Andrew - Supernews
On 2005-10-22, Paul Lindner <[EMAIL PROTECTED]> wrote: > I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to > restore these results in > > Invalid UNICODE byte sequence detected near byte ... What were the exact offending bytes? > Question: > > Does the 8.1 Unicode sanity code a