Hello fine PostgreSQL bug-busters,

I'm having a rather annoying problem - a particular string is causing the 
Postgres COPY functionality to lose a byte, causing data corruption in backups 
and transferred data.

First, the environment -

 PostgreSQL 8.4.4 on i386-apple-darwin10.3.0, compiled by GCC 
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5646) (dot 1), 
64-bit

Mac OS X 10.6.4

[ste...@xxx:~]% psql --version
psql (PostgreSQL) 8.4.4
contains support for command-line editing

Now, the setup:
          Name           |       Owner        | Encoding |  Collation  |    
Ctype    |   Access privileges   |  Size   | Tablespace |        Description    
  
baddb                    | xxxxxxx_production | UTF8     | en_US.utf-8 | 
en_US.utf-8 |                       | 207 MB  | pg_default | 

baddb=> create table badtable (a int, b int, c character varying, d character 
varying, e character varying, f character varying[], g text, h character 
varying[],i character varying[], j character varying[], k character varying[], 
l character varying[], m character varying[], n character varying[],o character 
varying, p character varying);

baddb=> \copy badtable from '/tmp/data.copy'
baddb=> \copy badtable to '/tmp/badness.copy'
baddb=> \copy badtable from '/tmp/badness.copy'
ERROR:  invalid byte sequence for encoding "UTF8": 0xcf2c
HINT:  This error can also happen if the byte sequence does not match the 
encoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  COPY badtable, line 1

Obviously, this wouldn't be too helpful without the datafile in question:

1       2377510 FOURSQUARE      1403504 Pizza Hut       {}      \N      {}      
{}      {}      {Pizza} {πίτσα,hut,food,ζωγράφου,pizza,eat,zografou}    {}      
\N      \N  \N

Since this is likely to be eaten by various mail clients or lost in 
translation, please find attached a TGZ of the data file in question.

Attachment: data.tgz
Description: Binary data


The hexdump shows exactly what goes wrong:

[ste...@xxx:/tmp]% hexdump -C data.copy
00000000  31 09 32 33 37 37 35 31  30 09 46 4f 55 52 53 51  |1.2377510.FOURSQ|
00000010  55 41 52 45 09 31 34 30  33 35 30 34 09 50 69 7a  |UARE.1403504.Piz|
00000020  7a 61 20 48 75 74 09 7b  7d 09 5c 4e 09 7b 7d 09  |za Hut.{}.\N.{}.|
00000030  7b 7d 09 7b 7d 09 7b 50  69 7a 7a 61 7d 09 7b cf  |{}.{}.{Pizza}.{.|
00000040  80 ce af cf 84 cf 83 ce  b1 2c 68 75 74 2c 66 6f  |.........,hut,fo|
00000050  6f 64 2c ce b6 cf 89 ce  b3 cf 81 ce ac cf 86 ce  |od,.............|
00000060  bf cf 85 2c 70 69 7a 7a  61 2c 65 61 74 2c 7a 6f  |...,pizza,eat,zo|
00000070  67 72 61 66 6f 75 7d 09  7b 7d 09 5c 4e 09 5c 4e  |grafou}.{}.\N.\N|
00000080  09 5c 4e 0a                                       |.\N.|
00000084
[ste...@xxx:/tmp]% hexdump -C badness.out
00000000  31 09 32 33 37 37 35 31  30 09 46 4f 55 52 53 51  |1.2377510.FOURSQ|
00000010  55 41 52 45 09 31 34 30  33 35 30 34 09 50 69 7a  |UARE.1403504.Piz|
00000020  7a 61 20 48 75 74 09 7b  7d 09 5c 4e 09 7b 7d 09  |za Hut.{}.\N.{}.|
00000030  7b 7d 09 7b 7d 09 7b 50  69 7a 7a 61 7d 09 7b cf  |{}.{}.{Pizza}.{.|
00000040  80 ce af cf 84 cf 83 ce  b1 2c 68 75 74 2c 66 6f  |.........,hut,fo|
00000050  6f 64 2c ce b6 cf 89 ce  b3 cf 81 ce ac cf 86 ce  |od,.............|
00000060  bf cf 2c 70 69 7a 7a 61  2c 65 61 74 2c 7a 6f 67  |..,pizza,eat,zog|
00000070  72 61 66 6f 75 7d 09 7b  7d 09 5c 4e 09 5c 4e 09  |rafou}.{}.\N.\N.|
00000080  5c 4e 0a                                          |\N.|


Note offset 0x62:
00000060  bf cf 85 2c 70 69 7a 7a  61 2c 65 61 74 2c 7a 6f  |...,pizza,eat,zo|
00000060  bf cf 2c 70 69 7a 7a 61  2c 65 61 74 2c 7a 6f 67  |..,pizza,eat,zog|

The 0xCF85 multibyte UTF-8 character was truncated to 0xCF!  When I try to 
reimport it, it goes in as CF2C (the 2C is the following comma) which is not 
valid and matches the error printed by the client.

I can reproduce this problem using the JDBC driver as well, so I do not think 
it is a PSQL bug but instead a PostgreSQL backend problem.  I could be wrong, 
of course.

Interestingly enough, this problem *does not* happen on a different machine - 
 PostgreSQL 8.4.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 
4.4.3-4ubuntu5) 4.4.3, 64-bit
(stock Ubuntu package)

I can reliably reproduce it on two different Macs, though.

As with all random email posters, I'm not actually subscribed to the list - 
please keep me on the CC for replies.

I would very much appreciate help tracking this down!  Thanks for your time :)

Steven Schlansker

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to