Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-25 Thread Andrew Dunstan
Marko Kreen wrote: On 9/25/09, to...@tuxteam.de wrote: On Thu, Sep 24, 2009 at 09:42:32PM +0300, Peter Eisentraut wrote: > Good idea. This could also check for other invalid things like > byte-order marks in UTF-8. But watch out. Microsoft apps do like to insert a BOM at the beginning

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-25 Thread Marko Kreen
On 9/25/09, to...@tuxteam.de wrote: > On Thu, Sep 24, 2009 at 09:42:32PM +0300, Peter Eisentraut wrote: > > Good idea. This could also check for other invalid things like > > byte-order marks in UTF-8. > > But watch out. Microsoft apps do like to insert a BOM at the beginning > of the text. N

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-24 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Thu, Sep 24, 2009 at 09:42:32PM +0300, Peter Eisentraut wrote: > On Wed, 2009-09-23 at 22:46 +0300, Marko Kreen wrote: [...] > Good idea. This could also check for other invalid things like > byte-order marks in UTF-8. But watch out. Microsoft a

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-24 Thread Peter Eisentraut
On Wed, 2009-09-23 at 22:46 +0300, Marko Kreen wrote: > I looked at your code for U& and saw that you allow standalone > second half of the surrogate pair there, although you error > out on first half. Was that deliberate? No. > Perhaps pg_verifymbstr() should be made to check for such values, >

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-23 Thread Marko Kreen
On 9/23/09, Peter Eisentraut wrote: > On Wed, 2009-09-09 at 18:26 +0300, Marko Kreen wrote: > > Unicode escapes for extended strings. > > Committed. Thank you for handling the patch. I looked at your code for U& and saw that you allow standalone second half of the surrogate pair there, although

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-22 Thread Peter Eisentraut
On Wed, 2009-09-09 at 18:26 +0300, Marko Kreen wrote: > Unicode escapes for extended strings. > > On 4/16/09, Marko Kreen wrote: > > Reasons: > > > > - More people are familiar with \u escaping, as it's standard > > in Java/C#/Python, probably more.. > > - U& strings will not work when stdstr

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-21 Thread Peter Eisentraut
On Wed, 2009-09-09 at 18:26 +0300, Marko Kreen wrote: > Unicode escapes for extended strings. > > On 4/16/09, Marko Kreen wrote: > > Reasons: > > > > - More people are familiar with \u escaping, as it's standard > > in Java/C#/Python, probably more.. > > - U& strings will not work when stdstr

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-09 Thread Marko Kreen
Unicode escapes for extended strings. On 4/16/09, Marko Kreen wrote: > Reasons: > > - More people are familiar with \u escaping, as it's standard > in Java/C#/Python, probably more.. > - U& strings will not work when stdstr=off. > > Syntax: > > \u - 16-bit value > \U -

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-19 Thread Tom Lane
Marko Kreen writes: > On 4/18/09, Tom Lane wrote: >> The point has come up before, and I kinda thought we *had* changed the >> lexer to reject \000. I see we haven't though. Curiously, this >> does fail: >> >> regression=# select U&'abc\xyz'; >> ERROR: invalid byte sequence for encoding "

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-18 Thread Marko Kreen
On 4/18/09, Tom Lane wrote: > Sam Mason writes: > > On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote: > >> On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote: > >>> Btw, is there any good reason why we don't reject \000, \x00 > >>> in text strings? > >> > >> W

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Kevin Grittner
Tom Lane wrote: > The lexer is *not* allowed to invoke any database operations > (such as pg_conversion lookups) I certainly hope it's not! > so it cannot perform arbitrary encoding conversions. I was more questioning whether we should be looking at character encodings at all at that point

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Marko Kreen
On 4/18/09, Tom Lane wrote: > "Kevin Grittner" writes: > > Andrew Dunstan wrote: > >> ISTM that one of the uses of this is to say "store the character > >> that corresponds to this Unicode code point in whatever the database > >> encoding is" > > > I would think you're right. As long as th

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Tom Lane
Sam Mason writes: > On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote: >> On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote: >>> Btw, is there any good reason why we don't reject \000, \x00 >>> in text strings? >> >> Why forbid nulls in text strings? > As far as I

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Tom Lane
"Kevin Grittner" writes: > Andrew Dunstan wrote: >> ISTM that one of the uses of this is to say "store the character >> that corresponds to this Unicode code point in whatever the database >> encoding is" > I would think you're right. As long as the given character is in the > user's character

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Andrew Dunstan
Marko Kreen wrote: On 4/17/09, Kevin Grittner wrote: Andrew Dunstan wrote: > ISTM that one of the uses of this is to say "store the character > that corresponds to this Unicode code point in whatever the database > encoding is" I would think you're right. As long as the given charact

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Marko Kreen
On 4/17/09, Kevin Grittner wrote: > Andrew Dunstan wrote: > > ISTM that one of the uses of this is to say "store the character > > that corresponds to this Unicode code point in whatever the database > > encoding is" > > I would think you're right. As long as the given character is in the >

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Kevin Grittner
Andrew Dunstan wrote: > ISTM that one of the uses of this is to say "store the character > that corresponds to this Unicode code point in whatever the database > encoding is" I would think you're right. As long as the given character is in the user's character set, we should allow it. Presum

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Andrew Dunstan
Marko Kreen wrote: + if (c > 0x7F) + { + if (GetDatabaseEncoding() != PG_UTF8) + yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"); + saw_high_bit = true; + }

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Sam Mason
On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote: > On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote: > > Btw, is there any good reason why we don't reject \000, \x00 > > in text strings? > > Why forbid nulls in text strings? As far as I know, PG assumes, like mos

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Martijn van Oosterhout
On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote: > Btw, is there any good reason why we don't reject \000, \x00 > in text strings? Why forbid nulls in text strings? Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > Please line up in a tree and maintain the h

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Marko Kreen
On 4/16/09, Marko Kreen wrote: > It's up to UTF8 validator whether to consider non-characters as error. I checked, and it did not work well, as addunicode() did not set the saw_high_bit variable.when outputting UTF8. Attached patch fixes it. Currently is would be NOP as pg_verifymbstr() only ch

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Marko Kreen
On 4/16/09, Sam Mason wrote: > On Thu, Apr 16, 2009 at 08:48:58PM +0300, Marko Kreen wrote: > > Seems I'm bad at communicating in english, > > > I hope you're not saying this because of my misunderstandings! > > > > so here is C variant of > > my proposal to bring \u escaping into extended stri

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 03:04:37PM -0400, Andrew Dunstan wrote: > Sam Mason wrote: > >Are you sure that this handling of surrogates is correct? The best > >answer I've managed to find on the Unicode consortium's site is: > > > > http://unicode.org/faq/utf_bom.html#utf16-7 > > > >it says: > > > >

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Andrew Dunstan
Sam Mason wrote: Are you sure that this handling of surrogates is correct? The best answer I've managed to find on the Unicode consortium's site is: http://unicode.org/faq/utf_bom.html#utf16-7 it says: They are invalid in interchange, but may be freely used internal to an implementati

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 08:48:58PM +0300, Marko Kreen wrote: > Seems I'm bad at communicating in english, I hope you're not saying this because of my misunderstandings! > so here is C variant of > my proposal to bring \u escaping into extended strings. Reasons: > > - More people are familiar wi

[HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Marko Kreen
Seems I'm bad at communicating in english, so here is C variant of my proposal to bring \u escaping into extended strings. Reasons: - More people are familiar with \u escaping, as it's standard in Java/C#/Python, probably more.. - U& strings will not work when stdstr=off. Syntax: \u