[BUGS] 7.3.2 incorrectly counts characters for unicode varchar field
POSTGRESQL BUG REPORT TEMPLATE Your name : Matthew Cooper Your email address : matty (at) cloverworxs (dot) com System Configuration - Architecture (example: Intel Pentium) : Intel Pentium Operating System (example: Linux 2.0.26 ELF) : Redhat 8.0 / 9.0 PostgreSQL version (example: PostgreSQL-7.2.2): PostgreSQL-7.2.2 / 7.3.2 Compiler used (example: gcc 2.95.2) : none Please enter a FULL description of your problem: I have a database with UNICODE encoding set. In it is a table with a varchar(10) column. If I insert 10 western characters into it, it is OK. If I insert 10 chinese characters it says: postgresql value too long for type character varying 10 when using 7.3.2. If I use 7.2.2 it works fine. Please describe a way to repeat the problem. Please try to provide a concise reproducible example, if at all possible: -- createdb -E UNICODE mydb Then in psql... create table mgc (c1 varchar(10)); insert into mgc values('0123456789'); This all works fine. Now I put the following command into a file (say my.sql) which is UTF-8 encoded and the literal is 10 chinese characters. (I don't know if once emailed this command will be readable so you may have to re-create the command by pasting 10 chinese characters into your favourite UTF-8 compatible editor.) insert into mgc values ('åˆ†é’Ÿç»ƒä¹ åˆ†é’Ÿç»ƒä¹ ç»ƒä¹ '); I then run psql -f my.sql and get the error for 7.3.2 but it works for 7.2.2. If you know how this problem might be fixed, list the solution below: - I am guessing it is incorrectly counting the bytes and not the characters. Presumably a workaround is to double the length of the field. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar field
Attached is the UTF-8 encoded sql file in case it got messed up in the mail transfer. And here it is pasted in directly from the window that was displaying chinese characters. insert into mgc values ('分钟练习分钟练习练习'); Looking at the UTF-8 documentation, 10 chinese characters could be any number of bytes, each character being say 2 or 3 characters. Matty. - Original Message - From: "Tom Lane" <[EMAIL PROTECTED]> To: "Matthew Cooper" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Saturday, September 13, 2003 5:51 PM Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar field > > insert into mgc values ('Ã¥Ë?â? éâ?TŸç»Æ'ä¹ Ã¥Ë?â? éâ?TŸç»Æ'ä¹ ç»Æ'ä¹ '); > > I don't think this string is correctly unicode-encoded. Anyway "length" > claims it is 30 characters. > > regards, tom lane > mgc.sql Description: Binary data ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar field
Doh! It looks like its time to eat humble pie. It turns out that the guy here who has 7.3.4 and helped me to reproduce the problem did not follow our own installation instructions (that he recently re-worded!) as follows: "createdb -E UNICODE -U DB_USER -P DB_PASSWORD DB_NAME" and did not set the encoding. I, like a good boy, did on my 7.2 installation. The guys I am trying to debug the problem for are in another location and are using 7.3.4 too. Hence I narrowed it down to a version problem. I am asking them to check the encoding on their database too and will post back with huge apologies and thanks for your time when they inevitably confirm that the encoding is SQL_ANSI. Thanks, Matty. - Original Message - From: "Matthew Cooper" <[EMAIL PROTECTED]> To: "Tom Lane" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Monday, September 15, 2003 9:50 AM Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar field > Attached is the UTF-8 encoded sql file in case it got messed up in the mail > transfer. > > And here it is pasted in directly from the window that was displaying > chinese characters. > > insert into mgc values ('分钟练习分钟练习练习'); > > > Looking at the UTF-8 documentation, 10 chinese characters could be any > number of bytes, each character being say 2 or 3 characters. > > Matty. > - Original Message - > From: "Tom Lane" <[EMAIL PROTECTED]> > To: "Matthew Cooper" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Saturday, September 13, 2003 5:51 PM > Subject: Re: [BUGS] 7.3.2 incorrectly counts characters for unicode varchar > field > > > > > insert into mgc values ('Ã¥Ë?â? éâ?TŸç»Æ'ä¹ Ã¥Ë?â? > éâ?TŸç»Æ'ä¹ ç»Æ'ä¹ '); > > > > I don't think this string is correctly unicode-encoded. Anyway "length" > > claims it is 30 characters. > > > > regards, tom lane > > > ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html