On Aug 19, 2010, at 2:35 PM, Tom Lane wrote: > Steven Schlansker <ste...@trumpet.io> writes: >> I'm having a rather annoying problem - a particular string is causing the >> Postgres COPY functionality to lose a byte, causing data corruption in >> backups and transferred data. > > I was able to reproduce this on my own Mac. Some tracing shows that the > problem is that isspace(0x85) returns true when in locale en_US.utf-8. > This causes array_in to drop the final byte of the array element string, > thinking that it's insignificant whitespace.
The 0x85 seems to be the second byte of a multibyte UTF-8 sequence. I'm not at all experienced with character encodings so I could be totally off base, but isn't it wrong to ever call isspace(0x85), whatever the result may be, given that the actual character is 0xCF85? (U+03C5, GREEK SMALL LETTER UPSILON) > I believe that you must > not have produced the data file data.copy on a Mac, or at least not in > that locale setting, because array_out should have double-quoted the > array element given that behavior of isspace(). Correct, it was produced on a Linux machine. That said, the charset there was also UTF-8. > > Now, it's probably less than sane for isspace() to be behaving that way, > since in a UTF8-based locale 0x85 can't be regarded as a valid character > code at all. But I'm not hopeful about the results of filing a bug with > Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious > behaviors too, which they appear not to care much about. I actually can't reproduce that behavior here: #include <ctype.h> #include <stdio.h> int main() { printf("%d\n", isspace(0x85)); return 0; } [ste...@xxx:~]% gcc -o test test.c [ste...@xxx:~]% ./test 0 [ste...@xxx:~]% locale LANG="en_US.utf-8" LC_COLLATE="en_US.utf-8" LC_CTYPE="en_US.utf-8" LC_MESSAGES="en_US.utf-8" LC_MONETARY="en_US.utf-8" LC_NUMERIC="en_US.utf-8" LC_TIME="en_US.utf-8" LC_ALL= [ste...@xxx:~]% uname -a Darwin xxx.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386 i386 Thanks much for your help, Steven Schlansker -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs