Steven Schlansker <ste...@trumpet.io> writes: > I'm having a rather annoying problem - a particular string is causing the > Postgres COPY functionality to lose a byte, causing data corruption in > backups and transferred data.
I was able to reproduce this on my own Mac. Some tracing shows that the problem is that isspace(0x85) returns true when in locale en_US.utf-8. This causes array_in to drop the final byte of the array element string, thinking that it's insignificant whitespace. I believe that you must not have produced the data file data.copy on a Mac, or at least not in that locale setting, because array_out should have double-quoted the array element given that behavior of isspace(). Now, it's probably less than sane for isspace() to be behaving that way, since in a UTF8-based locale 0x85 can't be regarded as a valid character code at all. But I'm not hopeful about the results of filing a bug with Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious behaviors too, which they appear not to care much about. In any case, stepping back and taking a larger viewpoint, it seems unsafe for array_in/array_out to have any locale-sensitive behavior anyhow. As an example, in a LATINx locale it is entirely sane for isspace() to return true for 0xA0, while it should certainly not do so in C locale. This means we are at risk of data corruption, ie dropping a valid data character, when an array value starting or ending with 0xA0 is dumped from a C-locale database and loaded into a LATINx-locale one. So it seems like the safest answer is to modify array_in/array_out to use an ASCII-only definition of isspace(). I believe this is traditionally defined as space, tab, CR, LF, VT, FF. We could perhaps trim that further, like just space and tab, but there might be some risk of breaking client code that expects the other traditional whitespace to be ignored. I'm not sure if there are any other places with similar risks. hstore's I/O routines contain isspace calls, but I haven't analyzed the implications. There is an isspace call in record_out but it is just there for cosmetic purposes and doesn't protect any decisions in record_in, so I think it's okay if it makes platform/locale-dependent choices. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers