Re: [GENERAL] finding bogus UTF-8

2011-02-16 Thread Vick Khera
On Tue, Feb 15, 2011 at 5:06 PM, Geoffrey Myers wrote: > I toyed with tr for a bit, but could not get it to work.  The above did not > work for me either.  Not exactly sure what it's doing, but here's a couple > of diff lines: check your shell escaping. You may need \\ to protect the \ -- Sent

Re: [GENERAL] finding bogus UTF-8

2011-02-15 Thread Geoffrey Myers
Vick Khera wrote: On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers wrote: comments would be appreciated. If all you're doing is filtering stdin to stdout and deleting a range of characters, it seems that tr would be a faster tool: cat foo.txt | tr -d '\000-\008\013-\037\177-\377' > foo-clea

Re: [GENERAL] finding bogus UTF-8

2011-02-15 Thread Vick Khera
On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers wrote: > comments would be appreciated. > If all you're doing is filtering stdin to stdout and deleting a range of characters, it seems that tr would be a faster tool: cat foo.txt | tr -d '\000-\008\013-\037\177-\377' > foo-cleaned.txt -- Sent v

Re: [GENERAL] finding bogus UTF-8

2011-02-15 Thread Marko Kreen
On Thu, Feb 10, 2011 at 9:02 PM, Scott Ribe wrote: > I know that I have at least one instance of a varchar that is not valid > UTF-8, imported from a source with errors (AMA CPT files, actually) before > PG's checking was as stringent as it is today. Can anybody suggest a query to > find such v

Re: [GENERAL] finding bogus UTF-8

2011-02-15 Thread Geoffrey Myers
Glenn Maynard wrote: On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe > wrote: I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as string

Re: [GENERAL] finding bogus UTF-8

2011-02-10 Thread Glenn Maynard
On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe wrote: > I know that I have at least one instance of a varchar that is not valid > UTF-8, imported from a source with errors (AMA CPT files, actually) before > PG's checking was as stringent as it is today. Can anybody suggest a query > to find such valu

Re: [GENERAL] finding bogus UTF-8

2011-02-10 Thread dennis jenkins
> > If you are interested, I can email to you the C and Perl source. > > It runs like this: > > # time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester | > ./bad-ascii-report.pl > unclean-ascii.rpt http://www.ecoligames.com/~djenkins/pgsql/ Disclaimer: I offer NO warranty. Use at your

Re: [GENERAL] finding bogus UTF-8

2011-02-10 Thread dennis jenkins
On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe wrote: > I know that I have at least one instance of a varchar that is not valid > UTF-8, imported from a source with errors (AMA CPT files, actually) before > PG's checking was as stringent as it is today. Can anybody suggest a query to > find such v

Re: [GENERAL] finding bogus UTF-8

2011-02-10 Thread dennis jenkins
I'm working on a project to convert a large database form SQL_ASCII to UTF-8. I am using this procedure: 1) pg_dump the SQL_ASCII database to an SQL text file. 2) Run through a small (efficient) C program that logs each line that contains ANY "unclean" ASCII text. 3) Parse that log with a small p

[GENERAL] finding bogus UTF-8

2011-02-10 Thread Scott Ribe
I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values? -- Scott Ribe scott_r...@elevated-dev.com http://w