Re: [HACKERS] Patch for collation using ICU

2005-03-24 Thread John Hansen
Useful if it's going to support earlier releases of ICU

Not all os's come with ICU3.2, debian for example, currently has 2.1 in
testing, and 2.6 in unstable.

... John 

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Palle Girgensohn
> Sent: Friday, March 25, 2005 10:40 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] Patch for collation using ICU
> 
> Hi!
> 
> I've put together a patch for using IBM's ICU package for collation.
> 
> If your OS does not have full support for collation ur 
> uppercase/lowercase in multibyte locales, this might be 
> useful. If you are using a multibyte character encoding in 
> your database and want collation, i.e. order by, and also 
> lower(), upper() and initcap() to work properly, this patch 
> will do just that.
> 
> This patch is needed for FreeBSD, since this OS has no 
> support for collation of for example unicode locales (that 
> is, wcscoll(3) does not do what you expect if you set 
> LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* 
> necessary for Linux, although IBM claims ICU collation to be 
> about twice as fast as glibc for simple western locales.
> 
> It adds a configure switch, `--with-icu', which will set up 
> the code to use ICU instead of wchar_t and wcscoll.
> 
> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, 
> where it seems to run well. I've not had the time to do any 
> comparative performance tests yet, but it seems it is at 
> least not slower than using LATIN1 with
> sv_SE.ISO8859-1 locale, perhaps even faster.
> 
> I'd be delighted if some more experienced postgresql hackers 
> would review this stuff. The patch is pretty compact, so it's 
> fast reading :)  I'm planning to add this patch as an option 
> (tagged "experimental") to FreeBSD's postgresql port. Any 
> ideas about whether this is a good idea or not?
> 
> Any thoughts or ideas are welcome!
> 
> Cheers,
> Palle
> 
> Patch at:
> 
> 
> ICU at sourceforge: 
> 
> 
> ---(end of 
> broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 
> 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Patch for collation using ICU

2005-03-25 Thread John Hansen
> --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen 
> <[EMAIL PROTECTED]>
> wrote:
> 
> > Useful if it's going to support earlier releases of ICU
> >
> > Not all os's come with ICU3.2, debian for example, 
> currently has 2.1 
> > in testing, and 2.6 in unstable.
> 
> Oh, OK. FreeBSD has only the 3.2 as port. I can check the 
> older version, I doubt it would too much difference. Some 
> autoconf sorcery needed, perhaps.

Naww, it's no biggie, we'll just need to include ICU with pg I think.
I tried that, there are several functions from ICU that you use, that
are not in ICU2.1

Dono about 2.6.

However, ICU3.2 compiles on debian with a small change to the
debian/rules file.
debian/tmp/etc is missing, so add mkdir debian/tmp/etc

... John

> 
> /Palle
> 
> >
> > ... John
> >
> >> -Original Message-
> >> From: [EMAIL PROTECTED]
> >> [mailto:[EMAIL PROTECTED] On Behalf Of Palle 
> >> Girgensohn
> >> Sent: Friday, March 25, 2005 10:40 AM
> >> To: pgsql-hackers@postgresql.org
> >> Subject: [HACKERS] Patch for collation using ICU
> >>
> >> Hi!
> >>
> >> I've put together a patch for using IBM's ICU package for 
> collation.
> >>
> >> If your OS does not have full support for collation ur 
> >> uppercase/lowercase in multibyte locales, this might be useful. If 
> >> you are using a multibyte character encoding in your database and 
> >> want collation, i.e. order by, and also lower(), upper() and 
> >> initcap() to work properly, this patch will do just that.
> >>
> >> This patch is needed for FreeBSD, since this OS has no support for 
> >> collation of for example unicode locales (that is, wcscoll(3) does 
> >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for 
> example). 
> >> AFAIK the patch is *not* necessary for Linux, although IBM 
> claims ICU 
> >> collation to be about twice as fast as glibc for simple western 
> >> locales.
> >>
> >> It adds a configure switch, `--with-icu', which will set 
> up the code 
> >> to use ICU instead of wchar_t and wcscoll.
> >>
> >> This has been tested only on FreeBSD-4.11 & 
> FreeBSD-5-stable, where 
> >> it seems to run well. I've not had the time to do any comparative 
> >> performance tests yet, but it seems it is at least not slower than 
> >> using LATIN1 with
> >> sv_SE.ISO8859-1 locale, perhaps even faster.
> >>
> >> I'd be delighted if some more experienced postgresql hackers would 
> >> review this stuff. The patch is pretty compact, so it's 
> fast reading 
> >> :)  I'm planning to add this patch as an option (tagged 
> >> "experimental") to FreeBSD's postgresql port. Any ideas 
> about whether 
> >> this is a good idea or not?
> >>
> >> Any thoughts or ideas are welcome!
> >>
> >> Cheers,
> >> Palle
> >>
> >> Patch at:
> >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
> > 005-03-14.diff>
> >>
> >> ICU at sourceforge: <http://icu.sf.net/>
> >>
> >>
> >> ---(end of
> >> broadcast)---
> >> TIP 7: don't forget to increase your free space map settings
> >>
> >>
> 
> 
> 
> 
> 
> 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Patch for collation using ICU

2005-03-25 Thread John Hansen
Ok,.. tested on debian sarge with ICU 3.2 
UNICODE Database, C locale.

upper() and lower() returns an empty string for any input, including
7bit ascii, regardless of client_encoding, so something is obviously
broken.

Have you tested this patch on a UNICODE DB with locale C/POSIX ?

... John

> -Original Message-
> From: John Hansen 
> Sent: Friday, March 25, 2005 10:27 PM
> To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org'
> Subject: RE: [HACKERS] Patch for collation using ICU
> 
> > --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen 
> > <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > Useful if it's going to support earlier releases of ICU
> > >
> > > Not all os's come with ICU3.2, debian for example,
> > currently has 2.1
> > > in testing, and 2.6 in unstable.
> > 
> > Oh, OK. FreeBSD has only the 3.2 as port. I can check the older 
> > version, I doubt it would too much difference. Some 
> autoconf sorcery 
> > needed, perhaps.
> 
> Naww, it's no biggie, we'll just need to include ICU with pg I think.
> I tried that, there are several functions from ICU that you 
> use, that are not in ICU2.1
> 
> Dono about 2.6.
> 
> However, ICU3.2 compiles on debian with a small change to the 
> debian/rules file.
> debian/tmp/etc is missing, so add mkdir debian/tmp/etc
> 
> ... John
> 
> > 
> > /Palle
> > 
> > >
> > > ... John
> > >
> > >> -Original Message-
> > >> From: [EMAIL PROTECTED]
> > >> [mailto:[EMAIL PROTECTED] On Behalf Of Palle 
> > >> Girgensohn
> > >> Sent: Friday, March 25, 2005 10:40 AM
> > >> To: pgsql-hackers@postgresql.org
> > >> Subject: [HACKERS] Patch for collation using ICU
> > >>
> > >> Hi!
> > >>
> > >> I've put together a patch for using IBM's ICU package for
> > collation.
> > >>
> > >> If your OS does not have full support for collation ur 
> > >> uppercase/lowercase in multibyte locales, this might be 
> useful. If 
> > >> you are using a multibyte character encoding in your 
> database and 
> > >> want collation, i.e. order by, and also lower(), upper() and
> > >> initcap() to work properly, this patch will do just that.
> > >>
> > >> This patch is needed for FreeBSD, since this OS has no 
> support for 
> > >> collation of for example unicode locales (that is, 
> wcscoll(3) does 
> > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
> > example). 
> > >> AFAIK the patch is *not* necessary for Linux, although IBM
> > claims ICU
> > >> collation to be about twice as fast as glibc for simple western 
> > >> locales.
> > >>
> > >> It adds a configure switch, `--with-icu', which will set
> > up the code
> > >> to use ICU instead of wchar_t and wcscoll.
> > >>
> > >> This has been tested only on FreeBSD-4.11 &
> > FreeBSD-5-stable, where
> > >> it seems to run well. I've not had the time to do any 
> comparative 
> > >> performance tests yet, but it seems it is at least not 
> slower than 
> > >> using LATIN1 with
> > >> sv_SE.ISO8859-1 locale, perhaps even faster.
> > >>
> > >> I'd be delighted if some more experienced postgresql 
> hackers would 
> > >> review this stuff. The patch is pretty compact, so it's
> > fast reading
> > >> :)  I'm planning to add this patch as an option (tagged
> > >> "experimental") to FreeBSD's postgresql port. Any ideas
> > about whether
> > >> this is a good idea or not?
> > >>
> > >> Any thoughts or ideas are welcome!
> > >>
> > >> Cheers,
> > >> Palle
> > >>
> > >> Patch at:
> > >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
> > > 005-03-14.diff>
> > >>
> > >> ICU at sourceforge: <http://icu.sf.net/>
> > >>
> > >>
> > >> ---(end of
> > >> broadcast)---
> > >> TIP 7: don't forget to increase your free space map settings
> > >>
> > >>
> > 
> > 
> > 
> > 
> > 
> > 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-03-25 Thread John Hansen
> --On fredag, mars 25, 2005 23.39.33 +1100 John Hansen 
> <[EMAIL PROTECTED]>
> wrote:
> 
> > Ok,.. tested on debian sarge with ICU 3.2 UNICODE Database, 
> C locale.
> >
> > upper() and lower() returns an empty string for any input, 
> including 
> > 7bit ascii, regardless of client_encoding, so something is 
> obviously 
> > broken.
> >
> > Have you tested this patch on a UNICODE DB with locale C/POSIX ?

FYI, I also found that initdb crashes with error 139 on any locale other
than C/POSIX.

> 
> No, honestly not. Mostly tested it with my needs, sv_SE.UTF-8 
> and UNICODE, and also de_DE.UTF-8.
> 
> How will PostgreSQL react to this combo? A database cluster 
> initdb:ed with locale=C/POSIX, and then a database in UNICODE 
> (really utf-8) representation... hmm... I think I might have 
> made a false assumption that the locale string would contain 
> the character encoding. I do something like encoding = 
> strchr(locale, '.') + 1... That code will be confused by a 'C' 
> locale, indeed. I'll check it out!
> 
> /Palle
> 
> 
> 
> >
> > ... John
> >
> >> -Original Message-
> >> From: John Hansen
> >> Sent: Friday, March 25, 2005 10:27 PM
> >> To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org'
> >> Subject: RE: [HACKERS] Patch for collation using ICU
> >>
> >> > --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen 
> >> > <[EMAIL PROTECTED]>
> >> > wrote:
> >> >
> >> > > Useful if it's going to support earlier releases of ICU
> >> > >
> >> > > Not all os's come with ICU3.2, debian for example,
> >> > currently has 2.1
> >> > > in testing, and 2.6 in unstable.
> >> >
> >> > Oh, OK. FreeBSD has only the 3.2 as port. I can check the older 
> >> > version, I doubt it would too much difference. Some
> >> autoconf sorcery
> >> > needed, perhaps.
> >>
> >> Naww, it's no biggie, we'll just need to include ICU with 
> pg I think.
> >> I tried that, there are several functions from ICU that 
> you use, that 
> >> are not in ICU2.1
> >>
> >> Dono about 2.6.
> >>
> >> However, ICU3.2 compiles on debian with a small change to the 
> >> debian/rules file.
> >> debian/tmp/etc is missing, so add mkdir debian/tmp/etc
> >>
> >> ... John
> >>
> >> >
> >> > /Palle
> >> >
> >> > >
> >> > > ... John
> >> > >
> >> > >> -Original Message-
> >> > >> From: [EMAIL PROTECTED]
> >> > >> [mailto:[EMAIL PROTECTED] On Behalf 
> Of Palle 
> >> > >> Girgensohn
> >> > >> Sent: Friday, March 25, 2005 10:40 AM
> >> > >> To: pgsql-hackers@postgresql.org
> >> > >> Subject: [HACKERS] Patch for collation using ICU
> >> > >>
> >> > >> Hi!
> >> > >>
> >> > >> I've put together a patch for using IBM's ICU package for
> >> > collation.
> >> > >>
> >> > >> If your OS does not have full support for collation ur 
> >> > >> uppercase/lowercase in multibyte locales, this might be
> >> useful. If
> >> > >> you are using a multibyte character encoding in your
> >> database and
> >> > >> want collation, i.e. order by, and also lower(), upper() and
> >> > >> initcap() to work properly, this patch will do just that.
> >> > >>
> >> > >> This patch is needed for FreeBSD, since this OS has no
> >> support for
> >> > >> collation of for example unicode locales (that is,
> >> wcscoll(3) does
> >> > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
> >> > example).
> >> > >> AFAIK the patch is *not* necessary for Linux, although IBM
> >> > claims ICU
> >> > >> collation to be about twice as fast as glibc for 
> simple western 
> >> > >> locales.
> >> > >>
> >> > >> It adds a configure switch, `--with-icu', which will set
> >> > up the code
> >> > >> to use ICU instead of wchar_t and wcscoll.
> >> > >>
> >> > >> This has been tested only on FreeBSD-4.11 &
> >> > FreeBSD

Re: [HACKERS] Patch for collation using ICU

2005-03-25 Thread John Hansen
 

> -Original Message-
> From: Palle Girgensohn [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, March 26, 2005 1:10 PM
> To: pgsql-hackers@postgresql.org
> Cc: John Hansen; Andrew Dunstan
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn 
> <[EMAIL PROTECTED]> wrote:
> 
> > Hi!
> >
> > I've put together a patch for using IBM's ICU package for collation.
> >
> > If your OS does not have full support for collation ur 
> > uppercase/lowercase in multibyte locales, this might be 
> useful. If you 
> > are using a multibyte character encoding in your database and want 
> > collation, i.e. order by, and also lower(), upper() and 
> initcap() to 
> > work properly, this patch will do just that.
> >
> > This patch is needed for FreeBSD, since this OS has no support for 
> > collation of for example unicode locales (that is, 
> wcscoll(3) does not 
> > do what you expect if you set LC_ALL=sv_SE.UTF-8, for 
> example). AFAIK 
> > the patch is *not* necessary for Linux, although IBM claims ICU 
> > collation to be about twice as fast as glibc for simple 
> western locales.
> >
> > It adds a configure switch, `--with-icu', which will set up 
> the code 
> > to use ICU instead of wchar_t and wcscoll.
> >
> > This has been tested only on FreeBSD-4.11 & 
> FreeBSD-5-stable, where it 
> > seems to run well. I've not had the time to do any comparative 
> > performance tests yet, but it seems it is at least not slower than 
> > using
> > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
> >
> > I'd be delighted if some more experienced postgresql hackers would 
> > review this stuff. The patch is pretty compact, so it's 
> fast reading 
> > :)  I'm planning to add this patch as an option (tagged 
> > "experimental") to FreeBSD's postgresql port. Any ideas 
> about whether 
> > this is a good idea or not?
> >
> > Any thoughts or ideas are welcome!
> >
> > Cheers,
> > Palle
> >
> > Patch at:
> > 
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1
> > 4.d
> > iff>
> >
> > ICU at sourceforge: <http://icu.sf.net/>
> 
> 
> Hi!
> 
> There's a new patch to fix some reported problems.
> 
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
005-03-26.diff>
> 
> This version uses the DatabaseEncoding and sets the ICU 
> encoding at the same time. I had to create a conversion table 
> from PostgreSQL's own, somewhat odd and non-standard, names 
> of encodings, into the prefered IANA names. On or two of the 
> more odd ones might be slightly incorrect, hopefully not too 
> far off anyway?
> 
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
> 
> - ORDER BY is case insensitive when using ICU. This might 
> break the SQL standard (?), but sure is nice :)

This would mean that indexes are also case insensitive right?
Which makes it a Bad Thing(tm).

> - When the database is initialized using the C locale, 
> upper() and lower() normally does not work at all for 
> non-ASCII characters even if the database's encoding is say 
> LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, 
> and this is probably correct since the locale is still `C', I 
> believe?). The ICU patch changes nothing for the LATIN1 case, 
> since it does not act on single byte encodings, but for the 
> UNICODE representation, it works and does what I expect it 
> to, namely upper() and lower() neatly
> upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') 
> -> 'åäö'. 
> This is a good thing, although I'm surprised that upper/lower 
> is dragged along with the LC_COLLATE fixation at initdb. I 
> never run initdb in the C locale, but only now do I realize 
> how broken that really is if you need to store anything else 
> than English :-)

That is what I would have expected. However, it probably won't work for the 
more exotic cases, like turkish I, which depends on the locale.

> 
> I'd be delighted to get more feedback about this stuff.
> 
> Thanks,
> Palle
> 
> 
> 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Patch for collation using ICU

2005-03-25 Thread John Hansen
Still doesn't work for me :(
UNICODE DB
C locale

set client_encoding = iso88591;
select upper('æ');
 upper
---

(1 row)

Trying to initdb with en_IN.utf8

/usr/lib/postgresql/bin/initdb -D /var/lib/postgres/data/ -E UNICODE 
--locale=en_IN.utf8
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale en_IN.utf8.

fixing permissions on existing directory /var/lib/postgres/data ... ok
creating directory /var/lib/postgres/data/global ... ok
creating directory /var/lib/postgres/data/pg_xlog ... ok
creating directory /var/lib/postgres/data/pg_xlog/archive_status ... ok
creating directory /var/lib/postgres/data/pg_clog ... ok
creating directory /var/lib/postgres/data/pg_subtrans ... ok
creating directory /var/lib/postgres/data/base ... ok
creating directory /var/lib/postgres/data/base/1 ... ok
creating directory /var/lib/postgres/data/pg_tblspc ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 1000
creating configuration files ... ok
creating template1 database in /var/lib/postgres/data/base/1 ... ok
initializing pg_shadow ... ok
enabling unlimited row size for system tables ... ok
initializing pg_depend ... ok
creating system views ... ok
loading pg_description ... ok
creating conversions ... ok
setting privileges on built-in objects ... ok
creating information schema ... child process exited with exit code 139
initdb: removing contents of data directory "/var/lib/postgres/data"
 
... John


> -Original Message-
> From: Palle Girgensohn [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, March 26, 2005 1:10 PM
> To: pgsql-hackers@postgresql.org
> Cc: John Hansen; Andrew Dunstan
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn 
> <[EMAIL PROTECTED]> wrote:
> 
> > Hi!
> >
> > I've put together a patch for using IBM's ICU package for collation.
> >
> > If your OS does not have full support for collation ur 
> > uppercase/lowercase in multibyte locales, this might be 
> useful. If you 
> > are using a multibyte character encoding in your database and want 
> > collation, i.e. order by, and also lower(), upper() and 
> initcap() to 
> > work properly, this patch will do just that.
> >
> > This patch is needed for FreeBSD, since this OS has no support for 
> > collation of for example unicode locales (that is, 
> wcscoll(3) does not 
> > do what you expect if you set LC_ALL=sv_SE.UTF-8, for 
> example). AFAIK 
> > the patch is *not* necessary for Linux, although IBM claims ICU 
> > collation to be about twice as fast as glibc for simple 
> western locales.
> >
> > It adds a configure switch, `--with-icu', which will set up 
> the code 
> > to use ICU instead of wchar_t and wcscoll.
> >
> > This has been tested only on FreeBSD-4.11 & 
> FreeBSD-5-stable, where it 
> > seems to run well. I've not had the time to do any comparative 
> > performance tests yet, but it seems it is at least not slower than 
> > using
> > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
> >
> > I'd be delighted if some more experienced postgresql hackers would 
> > review this stuff. The patch is pretty compact, so it's 
> fast reading 
> > :)  I'm planning to add this patch as an option (tagged 
> > "experimental") to FreeBSD's postgresql port. Any ideas 
> about whether 
> > this is a good idea or not?
> >
> > Any thoughts or ideas are welcome!
> >
> > Cheers,
> > Palle
> >
> > Patch at:
> > 
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1
> > 4.d
> > iff>
> >
> > ICU at sourceforge: <http://icu.sf.net/>
> 
> 
> Hi!
> 
> There's a new patch to fix some reported problems.
> 
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
> 005-03-26.diff>
> 
> This version uses the DatabaseEncoding and sets the ICU 
> encoding at the same time. I had to create a conversion table 
> from PostgreSQL's own, somewhat odd and non-standard, names 
> of encodings, into the prefered IANA names. On or two of the 
> more odd ones might be slightly incorrect, hopefully not too 
> far off anyway?
> 
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
> 
> - ORDER BY is case insensitive when using ICU. This might 
> break the SQL standard (?), but sure is nice :)
> 
> - When the database is initialized using the C locale, 
> upper() and lower() normal

Re: [HACKERS] Unicode problems on IRC

2005-04-09 Thread John Hansen
 

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Bruce Momjian
> Sent: Sunday, April 10, 2005 8:18 AM
> To: Christopher Kings-Lynne
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Unicode problems on IRC
> 
> Christopher Kings-Lynne wrote:
> > Hey guys,
> > 
> > The 'Unicode characters above 0x1' issue keeps rearing its ugly 
> > head in the IRC channel.  I propose that it be fixed, even 
> backported...
> > 
> > This is John Hansen's most recent patch to fix it:
> > 
> > http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
> > 
> > And from what I can tell it was committed, then reverted because it 
> > wasn't a "bug".  It was going to go in for 8.1.
> > 
> > We on the channel are starting to think that it is in fact a bug.  
> > There are are people with legitimately utf-8 encoded XML documents 
> > that they cannot store in PostgreSQL.  Apparently in the 
> distant past, 
> > Unicode was limited to 0x1, but then was extended.
> > 
> > Perhaps we can reopen this case...
> 
> Uh, I thought we fixed this another way, buy not using 
> Unicode-aware functions for upper/lower/initcap when the 
> locale is "C" or "POSIX". 
> That is backpatched to 8.0.X.  Does that not fix the problem reported?

No, as andrew said, what this patch does, is allow values > 0x and
at the same time validates the input to make sure it's valid utf8.


... John
> 
> -- 
>   Bruce Momjian|  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us   |  (610) 359-1001
>   +  If your life is a hard drive, |  13 Roberts Road
>   +  Christ can be your backup.|  Newtown Square, 
> Pennsylvania 19073
> 
> ---(end of 
> broadcast)---
> TIP 9: the planner will ignore your desire to choose an index 
> scan if your
>   joining column's datatypes do not match
> 
> 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Unicode problems on IRC

2005-04-10 Thread John Hansen
>On 2005-04-10, Tom Lane  wrote:
>> Andrew - Supernews 
writes:
>>> I think you will find that this impression is actually false. Or
that at
>>> the very least, _correct_ verification of UTF-8 sequences will still
>>> catch essentially all cases of non-utf-8 input mislabelled as utf-8
>>> while allowing the full range of Unicode codepoints.
>>
>> Yeah?  Cool.  Does John's proposed patch do it "correctly"?
>>
>> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html
>
>It looks correct to me. The only thing I think that code will let
through
>incorrectly are encoded surrogates; those could be fixed by adding one
line:
>
>  switch (*source) {
>  /* no fall-through in this inner switch */
>  case 0xE0: if (a < 0xA0) return false; break;
>+ case 0xED: if (a > 0x9F) return false; break;
>  case 0xF0: if (a < 0x90) return false; break;
>  case 0xF4: if (a > 0x8F) return false; break;
>

That's right, dono how I missed that one, but looks correct to me, and
is in line with the code in ConvertUTF.c from unicode.org, on which I
based the patch, extended to support 6 byte utf8 characters.

>(Accepting encoded surrogates in utf-8 was always forbidden by most
>specifications that used utf-8, though the Unicode specs originally
were
>not absolute about it (but forbade generating them). Current Unicode
>specifications define those sequences as malformed. Surrogates are the
>code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
>characters 0x1 - 0x10 as two 16-bit values; UTF-8 requires that
>such characters are encoded directly rather than via surrogate pairs.)
>
>-- 
>Andrew, Supernews
>http://www.supernews.com - individual and corporate NNTP services

... John

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] Recursive types?

2005-04-11 Thread John Hansen
Consider the following:

create table foo ( id serial primary key, path text);
alter table foo add ref foo;

   Table "public.foo"
 Column |  Type   |  Modifiers
+-+-
 id | integer | not null default nextval('public.foo_id_seq'::text)
 path   | text|
 ref| foo |
Indexes:
"foo_pkey" PRIMARY KEY, btree (id)


insert into foo (path) VALUES ('/');
insert into foo (path,ref) VALUES ('/foo',ROW(1,'/',NULL));
insert into foo (path,ref) VALUES
('/foo/bar',ROW(2,'/foo',ROW(1,'/',NULL)));

select * from foo;
 id |   path   |ref
+--+---
  1 | /|
  2 | /foo | (1,/,)
  3 | /foo/bar | (2,/foo,"(1,/,)")
(3 rows)

However, it is not possible to create such a type using create type, as
there is no alter type, even tho alter table does the same thing.

Is this a bug or a random feature?

It is also not possible to dump and restore this using pg_dump.

... John

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Call to build-in operator from new operator

2005-04-18 Thread John Hansen
Use the DirecFunctionCall1, DirecFunctionCall2, etc. functions.

... John

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
> Sent: Monday, April 18, 2005 10:40 PM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] Call to build-in operator from new operator
> 
> Greetings,
> 
> I don't really know if this is the correct place to ask this 
> question, if not please direct me to the correct mailing list.
> 
> I'm trying to develop a new operator for PostGreSQL (actually 
> for TelegraphCQ, which is an extension of PSQL). Part of the 
> operator's procedure is the @-operator. So now my question 
> is: How do I call the on_pb function from inside my own function?.
> The on_pb of course takes the argument 'PG_FUNCTION_ARGS' 
> which is defined in fmgr.h as 'FunctionCallInfo fcinfo' which 
> is defined as pointer to 'struct FunctionCallInfoData', so my 
> question boils down to:
> 
> What do I put into this struct to call 'on_pb' with two 
> arguments from the call to my function?
> 
> 
> Further, is there a way to access data in tables in the 
> database other than those given as arguments to the function? And how?
> 
> Sincerely
> Kim Bille
> Department of Computer Science
> Aalborg University
> Denmark
> 
> --
> "Mind are like parachutes --- they only work when open"
> 
> ---(end of 
> broadcast)---
> TIP 6: Have you searched our list archives?
> 
>http://archives.postgresql.org
> 
> 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Wierd performance issue with 8.1cvs

2005-04-24 Thread John Hansen
> I appreciate getting corrected by people I know in a limited 
> forum. I would not expect it when I do a mistakes here. Can't 
> say it ever has happend although there's often good grounds 
> for it so I have nothing to complain about. :-)

I think you meant to say 'I Can't say it ever has happened...'

:)

... John

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32

2005-04-24 Thread John Hansen
Look at the upper/lower I sent to the list, they should be able to
replace upper/lower for the utf8 encoding (and works independent of
locale)..

... John

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Bruce Momjian
> Sent: Sunday, April 24, 2005 10:35 PM
> To: Tatsuo Ishii
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32
> 
> 
> Where are we on this?  As far as I can tell, we never disabled UTF8 on
> Win32 in our code.  The only thing we did do was to disable 
> UTF8 in pginstaller.  See this FAQ item:
> 
>   
> http://pginstaller.projects.postgresql.org/faq/FAQ_windows.html#2.6
> 
> Is the current setup OK?  Should we allow UTF8 on Win32 for 
> languages that can use C locale, like Asian languages?
> 
> --
> -
> 
> Tatsuo Ishii wrote:
> > I do understand the problem, but don't undertstand the decision you 
> > guys made. The fact that UPPER/LOWER and some other 
> functions does not 
> > work in win32 is surely a problem for some languages, but not a 
> > problem for otheres. For example, Japanese (and probably Chinese and
> > Korean) does not have a concept upper/lower. So the fact 
> UPPER/LOWER 
> > does not work with UTF-8/win32 is not problem for Japanese (and for 
> > some other languages). Just using C locale with UTF-8 is enough in 
> > this case.
> > 
> > In summary, I think you guys are going to overkill the multibyte 
> > support functionality on UTF-8/win32 because of the fact that some 
> > langauges do not work.
> > 
> > Same thing can be said to EUC-JP, EUC-CN and EUC-KR and so 
> on as well.
> > 
> > I strongly object the policy to try to unconditionaly disable UTF-8 
> > support on win32.
> > --
> > Tatsuo Ishii
> > 
> > From: "Magnus Hagander" <[EMAIL PROTECTED]>
> > Subject: RE: [pgsql-hackers-win32] UNICODE/UTF-8 on win32
> > Date: Sat, 1 Jan 2005 14:48:04 +0100
> > Message-ID: 
> > <[EMAIL PROTECTED]>
> > 
> > > UNICODE/UTF-8 does not work on the win32 server. The 
> reason is that
> > > strcoll() and friends don't work with it. To support it 
> on win32, it 
> > > needs to be converted to UTF16 and use the wide-character 
> versions 
> > > of the fucntion. Which we do not do.
> > > (See
> > > 
> http://archives.postgresql.org/pgsql-hackers-win32/2004-11/msg00036.
> > > php
> > > and
> > > 
> http://archives.postgresql.org/pgsql-hackers-win32/2004-12/msg00106.
> > > php)
> > > 
> > > 
> > > I don't *think* we need to disable ito n the client. AFAIK, the 
> > > client interfaces don't use any of these functions, and I've seen 
> > > reports of people using that long before we had a native 
> win32 server.
> > > 
> > > 
> > > //Magnus
> > > 
> > > 
> > > >-Original Message-
> > > >From: Tatsuo Ishii [mailto:[EMAIL PROTECTED]
> > > >Sent: den 1 januari 2005 01:10
> > > >To: [EMAIL PROTECTED]
> > > >Cc: Magnus Hagander; [EMAIL PROTECTED]
> > > >Subject: Re: [pgsql-hackers-win32] UNICODE/UTF-8 on win32
> > > >
> > > >
> > > >Sorry, but I don't subscribe to pgsql-hackers-win32 list. What's 
> > > >the problem here?
> > > >--
> > > >Tatsuo Ishii
> > > >
> > > >> "Magnus Hagander" <[EMAIL PROTECTED]> writes:
> > > >> > We know it's broken and won't be fixed for 8.0.
> > > >> 
> > > >> > If we just #ifndef WIN32 the definitions in
> > > >utils/mb/encnames.c it won't
> > > >> > be possible to select that encoding, right? Will that have
> > > >any other
> > > >> > unwanted effects (such as breaking client encodings)? If
> > > >not, I suggest
> > > >> > this is done.
> > > >> 
> > > >> I believe the subscripts in those arrays have to match the 
> > > >> encoding enum type, so you can't just ifdef out 
> individual entries.
> > > >> 
> > > >> > (Or perhaps something can be done in 
> pg_valid_server_encoding?)
> > > >> 
> > > >> Making the valid_server_encoding function reject it might work.
> > > >> Tatsuo-san would know for sure.
> > > >> 
> > > >> Should we also reject it as a client encoding, or does 
> that work OK?
> > > >> 
> > > >>regards, tom lane
> > > >> 
> > > >
> > > 
> > 
> > ---(end of 
> > broadcast)---
> > TIP 1: subscribe and unsubscribe commands go to 
> > [EMAIL PROTECTED]
> > 
> 
> -- 
>   Bruce Momjian|  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us   |  (610) 359-1001
>   +  If your life is a hard drive, |  13 Roberts Road
>   +  Christ can be your backup.|  Newtown Square, 
> Pennsylvania 19073
> 
> ---(end of 
> broadcast)---
> TIP 9: the planner will ignore your desire to choose an index 
> scan if your
>   joining column's datatypes do not match
> 
> 

---(end of broadcast)---
TIP 2: you can get off

Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32

2005-04-24 Thread John Hansen
Ehmm,... No the upper/lower replacements I sent to -hackers

ICU was not me Tho for win32 you're better off writing wrapper
classes for the win32 native functions. 

> -Original Message-
> From: Bruce Momjian [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, April 24, 2005 10:50 PM
> To: John Hansen
> Cc: Tatsuo Ishii; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32
> 
> John Hansen wrote:
> > Look at the upper/lower I sent to the list, they should be able to 
> > replace upper/lower for the utf8 encoding (and works 
> independent 
> > of locale)..
> 
> You mean ICU?  Yes, it seems like a good approach for 8.1.
> 
> --
> -
> 
> 
> > 
> > ... John
> > 
> > > -Original Message-
> > > From: [EMAIL PROTECTED]
> > > [mailto:[EMAIL PROTECTED] On Behalf Of Bruce 
> > > Momjian
> > > Sent: Sunday, April 24, 2005 10:35 PM
> > > To: Tatsuo Ishii
> > > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> > > [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] [pgsql-hackers-win32] 
> UNICODE/UTF-8 on win32
> > > 
> > > 
> > > Where are we on this?  As far as I can tell, we never 
> disabled UTF8 
> > > on
> > > Win32 in our code.  The only thing we did do was to disable
> > > UTF8 in pginstaller.  See this FAQ item:
> > > 
> > >   
> > > 
> http://pginstaller.projects.postgresql.org/faq/FAQ_windows.html#2.6
> > > 
> > > Is the current setup OK?  Should we allow UTF8 on Win32 for 
> > > languages that can use C locale, like Asian languages?
> > > 
> > > --
> > > -
> > > 
> > > Tatsuo Ishii wrote:
> > > > I do understand the problem, but don't undertstand the decision 
> > > > you guys made. The fact that UPPER/LOWER and some other
> > > functions does not
> > > > work in win32 is surely a problem for some languages, but not a 
> > > > problem for otheres. For example, Japanese (and 
> probably Chinese 
> > > > and
> > > > Korean) does not have a concept upper/lower. So the fact
> > > UPPER/LOWER
> > > > does not work with UTF-8/win32 is not problem for Japanese (and 
> > > > for some other languages). Just using C locale with UTF-8 is 
> > > > enough in this case.
> > > > 
> > > > In summary, I think you guys are going to overkill the 
> multibyte 
> > > > support functionality on UTF-8/win32 because of the 
> fact that some 
> > > > langauges do not work.
> > > > 
> > > > Same thing can be said to EUC-JP, EUC-CN and EUC-KR and so
> > > on as well.
> > > > 
> > > > I strongly object the policy to try to unconditionaly disable 
> > > > UTF-8 support on win32.
> > > > --
> > > > Tatsuo Ishii
> > > > 
> > > > From: "Magnus Hagander" <[EMAIL PROTECTED]>
> > > > Subject: RE: [pgsql-hackers-win32] UNICODE/UTF-8 on win32
> > > > Date: Sat, 1 Jan 2005 14:48:04 +0100
> > > > Message-ID: 
> > > > <[EMAIL PROTECTED]>
> > > > 
> > > > > UNICODE/UTF-8 does not work on the win32 server. The
> > > reason is that
> > > > > strcoll() and friends don't work with it. To support it
> > > on win32, it
> > > > > needs to be converted to UTF16 and use the wide-character
> > > versions
> > > > > of the fucntion. Which we do not do.
> > > > > (See
> > > > > 
> > > 
> http://archives.postgresql.org/pgsql-hackers-win32/2004-11/msg00036.
> > > > > php
> > > > > and
> > > > > 
> > > 
> http://archives.postgresql.org/pgsql-hackers-win32/2004-12/msg00106.
> > > > > php)
> > > > > 
> > > > > 
> > > > > I don't *think* we need to disable ito n the client. 
> AFAIK, the 
> > > > > client interfaces don't use any of these functions, and I've 
> > > > > seen reports of people using that long before we had a native
> > > win32 server.
> > > > > 
> > > > > 
> > > > > //Magnus
> > > > > 
> > >

Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32

2005-04-24 Thread John Hansen
Right, they were meant as a starting point, but if you can point me to
how I can obtain the current locale, then I can fix them to cover the
remaining 15 special cases.

... John 

> -Original Message-
> From: Tom Lane [mailto:[EMAIL PROTECTED] 
> Sent: Monday, April 25, 2005 2:01 AM
> To: John Hansen
> Cc: Bruce Momjian; Tatsuo Ishii; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32 
> 
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > Look at the upper/lower I sent to the list, they should be able to 
> > replace upper/lower for the utf8 encoding (and works 
> independent 
> > of locale)..
> 
> I was under the impression we couldn't use these, precisely 
> because they weren't locale-aware.  ("It works for most 
> people" isn't good enough.)
> 
> In any case, don't we need a solution that covers sorting 
> (strcoll) as well as upper/lower?
> 
>   regards, tom lane
> 
> 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] A proper fix for the conversion-function problem

2005-05-03 Thread John Hansen
> Are there any encodings we care about that require embedded zero
bytes?

UTF-8 does!

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] A proper fix for the conversion-function problem

2005-05-03 Thread John Hansen
Errm.. UTF-16/32

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
> Sent: Wednesday, May 04, 2005 1:22 PM
> To: Tom Lane; Tatsuo Ishii
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] A proper fix for the 
> conversion-function problem 
> 
> > Are there any encodings we care about that require embedded zero
> bytes?
> 
> UTF-8 does!
> 
> ---(end of 
> broadcast)---
> TIP 6: Have you searched our list archives?
> 
>http://archives.postgresql.org
> 
> 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-06 Thread John Hansen

> Why do you need to add a mapping of encoding names from iana 
> to our names?
> 

The pg encoding names are not recognized by ICU, hence the mappings

Install ICU 3.2 on your system, and run uconv -l, that will give you a
list of valid ICU encoding names.

... John

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-06 Thread John Hansen
Btw,

Does it feel right to have pg depend on the bleeding edge version of
ICU?
On many distro's, even gentoo (known for being bleeding edge) 2.8 is
still the default.


2.8 and 3.2 are however incompatible, and supporting both, would bloat
the source somewhat.

... John


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Where'd you get the licence from?
None of that is in the licence I'm reading!

(http://www-306.ibm.com/software/globalization/icu/index.jsp)
(http://www-306.ibm.com/software/globalization/icu/license.jsp) 

... John

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
> Sent: Saturday, May 07, 2005 3:17 PM
> To: Bruce Momjian
> Cc: Palle Girgensohn; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU 
> 
> Bruce Momjian  writes:
> > Tom Lane wrote:
> >> Not until ICU is released under a BSD license ...
> 
> > Well, readline isn't BSD either, but we use it.  It is any 
> different?
> 
> Did you read the license?  Some of the more troubling bits:
> 
> : It is the understanding of INTERNATIONAL BUSINESS MACHINES 
> CORPORATION
> : that the purpose for which its publications are being reproduced is
> : accurate and true as stated in your attached request.
> 
> (er, which attached request would that be?)
> 
> : Permission to quote from or reprint IBM publications is 
> limited to the
> : purpose and quantities originally requested and must not be 
> construed as
> : a blanket license to use the material for other purposes or 
> to reprint
> : other IBM copyrighted material.
> 
> : IBM reserves the right to withdraw permission to reproduce 
> copyrighted
> : material whenever, in its discretion, it feels that the privilege of
> : reproducing its material is being used in a way detrimental to its
> : interest or the above instructions are not being followed 
> properly to
> : protect its copyright.
> 
> : IBM may have patents or pending patent applications covering subject
> : matter in this document. The furnishing of this document 
> does not give
> : you any license to these patents. You can send license inquiries, in
> : writing, to:
> 
> : For license inquiries regarding double-byte (DBCS) 
> information, contact
> : the IBM Intellectual Property Department in your country or send
> : inquiries, in writing, to:
> 
>   regards, tom lane
> 
> ---(end of 
> broadcast)---
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to 
> [EMAIL PROTECTED])
> 
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> 
> I use this patch in production on one FreeBSD 4.10 server at 
> the moment. 
> With the latest version, I've had no problems. Logging is 
> swithed on for 
> now, and it shows no signs of ICU complaining. I'd like more 
> reports on 
> Linux, though.

I currently use this on gentoo with ICU3.2 unmasked.

Works a dream, even with locale C and UNICODE database.

Small test:

createdb --encoding UNICODE --locale C test
psql test
set client_encoding=iso88591;
CREATE TABLE test (t text);
INSERT INTO test (t) VALUES ('æøå');
set client_encoding=unicode;
INSERT INTO test (t) SELECT upper(t) FROM test;
set client_encoding=iso88591;
SELECT * FROM test;
  t
-
 æøå
 ÆØÅ
(2 rows)

Just as I'd expect, as upper/lower/initcap are locale independent for these 
characters.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Errm,... initdb --encoding UNICODE --locale C

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
> Sent: Saturday, May 07, 2005 10:23 PM
> To: Palle Girgensohn; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > 
> > I use this patch in production on one FreeBSD 4.10 server at the 
> > moment.
> > With the latest version, I've had no problems. Logging is 
> swithed on 
> > for now, and it shows no signs of ICU complaining. I'd like more 
> > reports on Linux, though.
> 
> I currently use this on gentoo with ICU3.2 unmasked.
> 
> Works a dream, even with locale C and UNICODE database.
> 
> Small test:
> 
> createdb --encoding UNICODE --locale C test psql test set 
> client_encoding=iso88591; CREATE TABLE test (t text); INSERT 
> INTO test (t) VALUES ('æøå'); set client_encoding=unicode; 
> INSERT INTO test (t) SELECT upper(t) FROM test; set 
> client_encoding=iso88591; SELECT * FROM test;
>   t
> -
>  æøå
>  ÆØÅ
> (2 rows)
> 
> Just as I'd expect, as upper/lower/initcap are locale 
> independent for these characters.
> 
> 
> ---(end of 
> broadcast)---
> TIP 5: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq
> 
> 

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
> Palle Girgensohn wrote:
> > >
> > > Is this patch ready for application?
> > 
> > I don't think so, not quite. I have not had any positive 
> reports from 
> > linux users, this is only tested in a FreeBSD environment. 
> I'd say it 
> > needs some more testing.
> 
> OK.
> 
> > Also, apparently, ICU is installed by default in many linux 
> > distributions, and usually it is version 2.8. Some linux users have 
> > asked me if there are plans for a patch that works with ICU 2.8. 
> > That's probably a good idea. IBM and the ICU folks seem to consider 
> > 3.2 to be the stable version, older versions are hard to 
> find on their 
> > sites, but most linux distributers seem to consider it too bleeding 
> > edge, even gentoo. I don't know why they don't agree.
> 
> Good point.  Why would linux folks need ICU?  Doesn't their 
> OS support encodings natively?  I am particularly excited 
> about this for OSs that don't have such encodings, like UTF8 
> support for Win32.
> 
> Because ICU will not be used unless enabled by configure, it 
> seems we are fine with only supporting the newest version.  
> Do Linux users need to use ICU for any reason?

Yes, because on many linux platforms locale support is broken.
Also, ICU enables full unicode support, particularly in multi-language
situations where locale is C, and makes upper/lower/initcap work as
expected, except where it depends on locale information.

There are also many other useful things in ICU that could be
implemented. Transliteration, and break-iterators for example.
Break-iteration particularly interresting for converting a text to a
list of words. Another is it's builtin substring searches.

> 
> > > I do have a few questions:
> > >
> > > Why don't you use the lc_ctype_is_c() part of this test?
> > >
> > >   if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> > 
> > Um, well, I didn't think about that. :)  What would be the 
> locale in 
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have 
> CTYPE=C and use 
> > a wide encoding, indeed. Then the strings will be handled 
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
> 
> The additional test is more of an optmization, and it fixes a 
> problem with some OSs that have processing problems with UTF8 
> when the locale is supposed to be turned off, like in "C".  I 
> realize ICU might be fine with it but the optimization still 
> is an issue.

That the locale is supposed to be turned off, doesn't mean it shouldn't
use ICU.
ICU is more than just locales.

> > > Why is so much code added, for example, in lower()?  The existing 
> > > multibyte code is much smaller, and lots of code is added 
> in other 
> > > places too.
> > 
> > ICU uses UTF-16 internally, so all strings must be 
> converted from the 
> > database encoding to UTF-16. Since that means the strings 
> need to be 
> > copied, I took the same approach as in 
> varlena.c:varstr_cmp(), where 
> > small strings use the heap and only larger strings use a palloc. 
> > Comments in varstr_cmp about performance made me use that approach.
> 
> Oh, interesting.   I think you need to create new functions that
> factor out that common code so the patch is smaller and 
> easier to maintain.
> 
> > Also, in the latest patch, I also added checks and logging 
> for *every* 
> > status returned from ICU. I hope this will help debugging 
> on debian, 
> > where previous version didn't work. That excessive status 
> checking is 
> > hardly be necessary once the stuff is better tested.
> > 
> > I think the string copying and heap/palloc choices stands 
> for most of 
> > the code bloat, together with the excessive status checking 
> and logging.
> 
> OK, move that into some common functions and I think it will 
> be better.
> 
> > > Why do you need to add a mapping of encoding names from 
> iana to our 
> > > names?
> > 
> > This was already answered by John Hansen... There's an old 
> thread here 
> > about the choice of the name "UNICODE" to describe an 
> encoding, which 
> > it doesn't. There's half a dozen unicode based encodings... 
> UTF-8 is 
> > used by postgresql, that would have been a better name... Similarly 
> > for most other encodings, really. ICU expect a setlocale(3) string 
> > (i.e. IANA). PostgreSQL can't provide it, so a mapping 
> table is required.
> 
> We have depricated UNICODE in 8.1 in

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen 
> <[EMAIL PROTECTED]>
> wrote:
> 
> > Errm,... initdb --encoding UNICODE --locale C
> 
> You mean that ICU *shall* be used even for the C locale, and 
> not as Bruce suggested here:

Yes, that's exactly what I mean.

> 
> >> I do have a few questions:
> >>
> >> Why don't you use the lc_ctype_is_c() part of this test?
> >>
> >>if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> >
> > Um, well, I didn't think about that. :)  What would be the 
> locale in 
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have 
> CTYPE=C and use 
> > a wide encoding, indeed. Then the strings will be handled 
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
> 
> John disagrees here, and I'm obliged to agree. Using the C 
> locale, one will expect C collation, but upper/lower is 
> better off still using ICU. Hence, the above stuff is *not* a 
> bug. Do we agree?
> 
> /Palle
> 
> 
> >
> >> -Original Message-
> >> From: [EMAIL PROTECTED]
> >> [mailto:[EMAIL PROTECTED] On Behalf Of 
> John Hansen
> >> Sent: Saturday, May 07, 2005 10:23 PM
> >> To: Palle Girgensohn; Bruce Momjian
> >> Cc: pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Patch for collation using ICU
> >>
> >> >
> >> > I use this patch in production on one FreeBSD 4.10 server at the 
> >> > moment.
> >> > With the latest version, I've had no problems. Logging is
> >> swithed on
> >> > for now, and it shows no signs of ICU complaining. I'd like more 
> >> > reports on Linux, though.
> >>
> >> I currently use this on gentoo with ICU3.2 unmasked.
> >>
> >> Works a dream, even with locale C and UNICODE database.
> >>
> >> Small test:
> >>
> >> createdb --encoding UNICODE --locale C test psql test set 
> >> client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO 
> >> test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO 
> >> test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; 
> >> SELECT * FROM test;
> >>   t
> >> -
> >>  æøå
> >>  ÆØÅ
> >> (2 rows)
> >>
> >> Just as I'd expect, as upper/lower/initcap are locale 
> independent for 
> >> these characters.
> >>
> >>
> >> ---(end of
> >> broadcast)---
> >> TIP 5: Have you checked our extensive FAQ?
> >>
> >>http://www.postgresql.org/docs/faq
> >>
> >>
> 
> 
> 
> 
> 
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Btw, I had been planning to propose replacing every single one of the built in 
charset conversion functions with calls to ICU (thus making pg _depend_ on 
ICU), as this would seem like a cleaner solution than for us to maintain our 
own conversion tables.

ICU also has a fair few conversions that we do not have at present.

Any thoughts?

... John

> -Original Message-
> From: John Hansen 
> Sent: Saturday, May 07, 2005 11:09 PM
> To: 'Palle Girgensohn'; 'Bruce Momjian'
> Cc: 'pgsql-hackers@postgresql.org'
> Subject: RE: [HACKERS] Patch for collation using ICU
> 
> > --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen 
> > <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > Errm,... initdb --encoding UNICODE --locale C
> > 
> > You mean that ICU *shall* be used even for the C locale, and not as 
> > Bruce suggested here:
> 
> Yes, that's exactly what I mean.
> 
> > 
> > >> I do have a few questions:
> > >>
> > >> Why don't you use the lc_ctype_is_c() part of this test?
> > >>
> > >>  if (pg_database_encoding_max_length() > 1 && 
> !lc_ctype_is_c())
> > >
> > > Um, well, I didn't think about that. :)  What would be the
> > locale in
> > > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
> > CTYPE=C and use
> > > a wide encoding, indeed. Then the strings will be handled
> > like byte-wide chars.
> > > Yeah, it's a bug. I'll fix it! Thanks.
> > 
> > John disagrees here, and I'm obliged to agree. Using the C 
> locale, one 
> > will expect C collation, but upper/lower is better off still using 
> > ICU. Hence, the above stuff is *not* a bug. Do we agree?
> > 
> > /Palle
> > 
> > 
> > >
> > >> -Original Message-
> > >> From: [EMAIL PROTECTED]
> > >> [mailto:[EMAIL PROTECTED] On Behalf Of
> > John Hansen
> > >> Sent: Saturday, May 07, 2005 10:23 PM
> > >> To: Palle Girgensohn; Bruce Momjian
> > >> Cc: pgsql-hackers@postgresql.org
> > >> Subject: Re: [HACKERS] Patch for collation using ICU
> > >>
> > >> >
> > >> > I use this patch in production on one FreeBSD 4.10 
> server at the 
> > >> > moment.
> > >> > With the latest version, I've had no problems. Logging is
> > >> swithed on
> > >> > for now, and it shows no signs of ICU complaining. I'd 
> like more 
> > >> > reports on Linux, though.
> > >>
> > >> I currently use this on gentoo with ICU3.2 unmasked.
> > >>
> > >> Works a dream, even with locale C and UNICODE database.
> > >>
> > >> Small test:
> > >>
> > >> createdb --encoding UNICODE --locale C test psql test set 
> > >> client_encoding=iso88591; CREATE TABLE test (t text); 
> INSERT INTO 
> > >> test (t) VALUES ('æøå'); set client_encoding=unicode; 
> INSERT INTO 
> > >> test (t) SELECT upper(t) FROM test; set 
> client_encoding=iso88591; 
> > >> SELECT * FROM test;
> > >>   t
> > >> -
> > >>  æøå
> > >>  ÆØÅ
> > >> (2 rows)
> > >>
> > >> Just as I'd expect, as upper/lower/initcap are locale
> > independent for
> > >> these characters.
> > >>
> > >>
> > >> ---(end of
> > >> broadcast)---
> > >> TIP 5: Have you checked our extensive FAQ?
> > >>
> > >>http://www.postgresql.org/docs/faq
> > >>
> > >>
> > 
> > 
> > 
> > 
> > 
> > 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Palle Girgensohn wrote:
> I'm aware of that. It might help for unicode, but there are a 
> bunch of 
> other encodings. IANA has decided that utf-8 has *no* 
> aliases, hence only 
> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
> fogiving, I don't remember/know, but I think we need the mappings, 
> unfortunately.
> 

Here is the list of encoding names and aliases the ICU accepts as of
3.2:
(it's a bit long...)

UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
UTF-16BE x-utf-16be ibm-1200 ibm-1201 ibm-5297 ibm-13488 ibm-17584
windows-1201 cp1200 cp1201 UTF16_BigEndian
UTF-16LE x-utf-16le ibm-1202 ibm-13490 ibm-17586 UTF16_LittleEndian
windows-1200
UTF-32 ISO-10646-UCS-4 csUCS4 ucs-4
UTF-32BE UTF32_BigEndian ibm-1232 ibm-1233
UTF-32LE UTF32_LittleEndian ibm-1234
UTF16_PlatformEndian
UTF16_OppositeEndian
UTF32_PlatformEndian
UTF32_OppositeEndian
UTF-7 windows-65000
IMAP-mailbox-name
SCSU
BOCU-1 csBOCU-1
CESU-8
ISO-8859-1 ibm-819 IBM819 cp819 latin1 8859_1 csISOLatin1 iso-ir-100
ISO_8859-1:1987 l1 819
US-ASCII ASCII ANSI_X3.4-1968 ANSI_X3.4-1986 ISO_646.irv:1991
iso_646.irv:1983 ISO646-US us csASCII iso-ir-6 cp367 ascii7 646
windows-20127
gb18030 ibm-1392 windows-54936
ibm-367_P100-1995 ibm-367 IBM367
ibm-912_P100-1995 ibm-912 iso-8859-2 ISO_8859-2:1987 latin2 csISOLatin2
iso-ir-101 l2 8859_2 cp912 912 windows-28592
ibm-913_P100-2000 ibm-913 iso-8859-3 ISO_8859-3:1988 latin3 csISOLatin3
iso-ir-109 l3 8859_3 cp913 913 windows-28593
ibm-914_P100-1995 ibm-914 iso-8859-4 latin4 csISOLatin4 iso-ir-110
ISO_8859-4:1988 l4 8859_4 cp914 914 windows-28594
ibm-915_P100-1995 ibm-915 iso-8859-5 cyrillic csISOLatinCyrillic
iso-ir-144 ISO_8859-5:1988 8859_5 cp915 915 windows-28595
ibm-1089_P100-1995 ibm-1089 iso-8859-6 arabic csISOLatinArabic
iso-ir-127 ISO_8859-6:1987 ECMA-114 ASMO-708 8859_6 cp1089 1089
windows-28596 ISO-8859-6-I ISO-8859-6-E
ibm-813_P100-1995 ibm-813 iso-8859-7 greek greek8 ELOT_928 ECMA-118
csISOLatinGreek iso-ir-126 ISO_8859-7:1987 8859_7 cp813 813
windows-28597
ibm-916_P100-1995 ibm-916 iso-8859-8 hebrew csISOLatinHebrew iso-ir-138
ISO_8859-8:1988 ISO-8859-8-I ISO-8859-8-E 8859_8 cp916 916 windows-28598
ibm-920_P100-1995 ibm-920 iso-8859-9 latin5 csISOLatin5 iso-ir-148
ISO_8859-9:1989 l5 8859_9 cp920 920 windows-28599 ECMA-128
ibm-921_P100-1995 ibm-921 iso-8859-13 8859_13 cp921 921
ibm-923_P100-1998 ibm-923 iso-8859-15 Latin-9 l9 8859_15 latin0
csisolatin0 csisolatin9 iso8859_15_fdis cp923 923 windows-28605
ibm-942_P12A-1999 ibm-942 ibm-932 cp932 shift_jis78 sjis78
ibm-942_VSUB_VPUA ibm-932_VSUB_VPUA
ibm-943_P15A-2003 ibm-943 Shift_JIS MS_Kanji csShiftJIS windows-31j
csWindows31J x-sjis x-ms-cp932 cp932 windows-932 cp943c IBM-943C ms932
pck sjis ibm-943_VSUB_VPUA
ibm-943_P130-1999 ibm-943 Shift_JIS cp943 943 ibm-943_VASCII_VSUB_VPUA
ibm-33722_P12A-1999 ibm-33722 ibm-5050 EUC-JP
Extended_UNIX_Code_Packed_Format_for_Japanese csEUCPkdFmtJapanese
X-EUC-JP eucjis windows-51932 ibm-33722_VPUA IBM-eucJP
ibm-33722_P120-1999 ibm-33722 ibm-5050 cp33722 33722
ibm-33722_VASCII_VPUA
ibm-954_P101-2000 ibm-954 EUC-JP
ibm-1373_P100-2002 ibm-1373 windows-950
windows-950-2000 Big5 csBig5 windows-950 x-big5
ibm-950_P110-1999 ibm-950 cp950 950
macos-2566-10.2 Big5-HKSCS big5hk HKSCS-BIG5
ibm-1375_P100-2003 ibm-1375 Big5-HKSCS
ibm-1386_P100-2002 ibm-1386 cp1386 windows-936 ibm-1386_VSUB_VPUA
windows-936-2000 GBK CP936 MS936 windows-936
ibm-1383_P110-1999 ibm-1383 GB2312 csGB2312 EUC-CN ibm-eucCN hp15CN
cp1383 1383 ibm-1383_VPUA
ibm-5478_P100-1995 ibm-5478 GB_2312-80 chinese iso-ir-58 csISO58GB231280
gb2312-1980 GB2312.1980-0
ibm-964_P110-1999 ibm-964 EUC-TW ibm-eucTW cns11643 cp964 964
ibm-964_VPUA
ibm-949_P110-1999 ibm-949 cp949 949 ibm-949_VASCII_VSUB_VPUA
ibm-949_P11A-1999 ibm-949 cp949c ibm-949_VSUB_VPUA
ibm-970_P110-1995 ibm-970 EUC-KR KS_C_5601-1987 windows-51949 csEUCKR
ibm-eucKR KSC_5601 5601 ibm-970_VPUA
ibm-971_P100-1995 ibm-971 ibm-971_VPUA
ibm-1363_P11B-1998 ibm-1363 KS_C_5601-1987 KS_C_5601-1989 KSC_5601
csKSC56011987 korean iso-ir-149 5601 cp1363 ksc windows-949
ibm-1363_VSUB_VPUA
ibm-1363_P110-1997 ibm-1363 ibm-1363_VASCII_VSUB_VPUA
windows-949-2000 windows-949 KS_C_5601-1987 KS_C_5601-1989 KSC_5601
csKSC56011987 korean iso-ir-149 ms949
ibm-1162_P100-1999 ibm-1162
ibm-874_P100-1995 ibm-874 ibm-9066 cp874 TIS-620 tis620.2533 eucTH
cp9066
windows-874-2000 TIS-620 windows-874 MS874
ibm-437_P100-1995 ibm-437 IBM437 cp437 437 csPC8CodePage437 windows-437
ibm-850_P100-1995 ibm-850 IBM850 cp850 850 csPC850Multilingual
windows-850
ibm-851_P100-1995 ibm-851 IBM851 cp851 851 csPC851
ibm-852_P100-1995 ibm-852 IBM852 cp852 852 csPCp852 windows-852
ibm-855_P100-1995 ibm-855 IBM855 cp855 855 csIBM855 csPCp855
ibm-856_P100-1995 ibm-856 cp856 856
ibm-857_P100-1995 ibm-857 IBM857 cp857 857 csIBM857 windows-857
ibm-858_P100-1997 ibm-858 IBM00858 CCSID00858 CP00858
PC-Multilingual-850+euro cp858
ibm-860_P100-1995 ibm-860 IBM860 cp860 860 csIBM8

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

> -Original Message-
> From: Palle Girgensohn [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, May 07, 2005 11:30 PM
> To: John Hansen; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] Patch for collation using ICU
> 
> 
> 
> --On lördag, maj 07, 2005 23.25.15 +1000 John Hansen 
> <[EMAIL PROTECTED]>
> wrote:
> 
> > Palle Girgensohn wrote:
> >> I'm aware of that. It might help for unicode, but there 
> are a bunch 
> >> of other encodings. IANA has decided that utf-8 has *no* aliases, 
> >> hence only
> >> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
> >> fogiving, I don't remember/know, but I think we need the mappings, 
> >> unfortunately.
> >>
> >
> > Here is the list of encoding names and aliases the ICU accepts as of
> > 3.2:
> > (it's a bit long...)
> >
> > UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
> 
> No UTF8 in there. I think that's good, charset aliases are a hassle.

Yup! :)

> 
> /Palle
> 
> 
> 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

> -Original Message-
> From: Palle Girgensohn [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, May 07, 2005 11:33 PM
> To: John Hansen; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] Patch for collation using ICU
> 
> 
> 
> --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen 
> <[EMAIL PROTECTED]>
> wrote:
> 
> >>
> >> I use this patch in production on one FreeBSD 4.10 server at the 
> >> moment.
> >> With the latest version, I've had no problems. Logging is 
> swithed on 
> >> for now, and it shows no signs of ICU complaining. I'd like more 
> >> reports on Linux, though.
> >
> > I currently use this on gentoo with ICU3.2 unmasked.
> >
> > Works a dream, even with locale C and UNICODE database.
> 
> Ah, good to hear, John. I beleive your report about linux is 
> what's keeping this back. Did you also manage to get it 
> running on Debian?

Not without ICU3.2

> 
> /Palle
> 
> 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> Did you try the latest patch? Maybe it will help, and if not, it will
> (hopefully) give a lot more informative error messages.

No, and I got rid of my debian boxes @ home.
The patch required a certain amount of modifications too, to even
compile with 2.8.

So I guess it's a valid question to ask: it it worth supporting 2.8?
It is of course an option to bundle icu 3.2 with pg!


... John

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

> -Original Message-
> From: Andrew Dunstan [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, May 07, 2005 11:39 PM
> To: John Hansen
> Cc: Palle Girgensohn; Bruce Momjian; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> 
> 
> John Hansen wrote:
> 
> >Here is the list of encoding names and aliases the ICU accepts as of
> >3.2:
> >(it's a bit long...)
> >
> >UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
> >UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
> >
> >  
> >
> [snip]
> 
> Don't we use "unicode" as an alias for UTF-8 ?

Yes, unfortunately!

> 
> cheers
> 
> andrew
> 
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
> 
> There are two reasons for that optimization --- first, some 
> locale support is broken and Unicode encoding with a C locale 
> crashes (not an issue for ICU), and second, it is an 
> optimization for languages like Japanese that want to use 
> unicode, but don't need a locale because upper/lower means 
> nothing in those character sets.

No, upper/lower means nothing in those languages, so why would you need
to optimize upper/lower if they're not used??
And if they are, it's obviously because the text contains characters
from other languages (probably english) and as such they should behave
correctly.

Did I mention that for japanese and the like, ICU would also offer
transliteration...

> 
> So, the first issue doesn't apply for ICU, and the second 
> might not depending on what characters you are using in the 
> Unicode character set.
> 
> I guess I am little confused how ICU can do upper() when the 
> locale is C.  What is it using to determine A is upper for a? 
>  Am I confused?

Simple, UNICODE basically consist of a table of characters
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)

Excerpt:

0041;LATIN CAPITAL LETTER A;Lu;0;L;N0061;
...
0061;LATIN SMALL LETTER A;Ll;0;L;N;;;0041;;0041

From this you can see, that for 0041, which is capital letter A, there
is a mapping to it's lowercase counterpart, 0061
Likewise, there is a mapping for 0061 which says it's uppercase
counterpart is 0041.
There is also SpecialCasing.txt which covers those mappings that haven't
got a 1-1 mapping, such as the german SS.

These mappings are fixed, independent of locale, only a few cases from
specialcasing.txt depend on locale/context.



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> It seems 3.2 has much more support and bug fixes, I'm not 
> sure if we should really consider 2.8?

As I said, probably not worth the effort.

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
> Palle Girgensohn wrote:
> > 
> > --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen 
> > <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > Btw, I had been planning to propose replacing every single one of 
> > > the built in charset conversion functions with calls to ICU (thus 
> > > making pg _depend_ on ICU), as this would seem like a cleaner 
> > > solution than for us to maintain our own conversion tables.
> > >
> > > ICU also has a fair few conversions that we do not have 
> at present.
> 
> That is a much larger issue, similar to our shipping our own 
> timezone database.  What does it buy us?
>   
>   o  Do we ship it in our tarball?
>   o  Is the license compatible?
>   o  Does it remove utils/mb conversions?
>   o  Does it allow us to index LIKE (next high char)?
>   o  Does it allow us to support multiple encodings in
>  a single database easier?
>   o  performance?
> 
> > I just had a similar though. And why use ICU only for 
> multibyte charsets? 
> > If I use LATIN1, I still expect upper('?') => SS, and I 
> don't get it... 
> > Same for the Turkish example.
> 
> We assume the native toupper() can handle single-byte 
> character encodings.  We use towupper() only for wide character sets.

That assumption is wrong,...

Encoding latin1
Locale <> de*

Select Upper('ß'); (lowercase german SS)
Should return SS, but returns ß

... John



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tom Lane wrote:
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > Where'd you get the licence from?
> 
> It was the first thing I came across in their docs:
> 
> http://icu.sourceforge.net/userguide/intro.html
> 
> Looking more closely, it may be that this license is only 
> intended to apply to the documentation and not the code ... 
> though "free" code with un-free documentation isn't real useful.
> 

Right, it seems to apply only to the resources found on sourceforge.

>   regards, tom lane
> 
> 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tom Lane wrote:
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > Btw, I had been planning to propose replacing every single 
> one of the 
> > built in charset conversion functions with calls to ICU 
> (thus making 
> > pg _depend_ on ICU),
> 
> I find that fairly unacceptable ... especially given the 
> licensing questions, but in any case.

The licencing seems pretty clear to me.
http://www-306.ibm.com/software/globalization/icu/license.jsp

> 
> It might be OK to rip out the existing conversion support and 
> say that *if* you want encoding conversion, you have to use 
> ICU.  But I don't want to be told you cannot build PG without 
> ICU period.

Right, that could be done, but I think the issue at heart is _are_ 
we going to use it at all, and if so, locale support would certainly 
benefit from going that way as well.

> 
> The 3.2 vs 2.8 business is disturbing also; specifically, I 
> don't think we get to require 3.2 on a platform where 2.8 is 
> installed.

There seems to be nothing in the ICU licence that would prevent us from
bundling it.
This would solve both the 3.2 vs 2.8 problems, and would remove the
'dependency'.

> People just aren't going to hold still for that, even 
> assuming that ICU supports installing both versions at once, 
> which isn't clear to me at the moment ...

There's no problems with having both installed.
I did that on debian to get the patch going.
Tho, bundling it seems cleaner to me.

> 
>   regards, tom lane
> 
> 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> I don't buy it. If current conversion tables does the right 
> thing, why we need to replace. Or if conversion tables are 
> not correct, why don't you fix it? I think the rule of 
> character conversion will not change frequently, especially 
> for LATIN languages. Thus maintaining cost is not too high.

I never said we need to, but if we're going to implement ICU,
then we might as well go all the way.

... John


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 10:09 AM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > Bruce Momjian wrote:
> > > 
> > > There are two reasons for that optimization --- first, 
> some locale 
> > > support is broken and Unicode encoding with a C locale 
> crashes (not 
> > > an issue for ICU), and second, it is an optimization for 
> languages 
> > > like Japanese that want to use unicode, but don't need a locale 
> > > because upper/lower means nothing in those character sets.
> > 
> > No, upper/lower means nothing in those languages, so why would you 
> > need to optimize upper/lower if they're not used??
> > And if they are, it's obviously because the text contains 
> characters 
> > from other languages (probably english) and as such they 
> should behave 
> > correctly.
> 
> Yes, Japanese (and probably Chinese and Korean) languages 
> include ASCII character. More precisely ASCII is part of Japanese
> encodings(LATIN1 is not, however). And we have no problem at 
> all with glibc/C locale. See below("unitest" is an UNICODE database).
> 
> unitest=# create table t1(t text);
> CREATE TABLE
> unitest=# \encoding EUC_JP
> unitest=# insert into t1 values('abcããã');
> INSERT 1842628 1
> unitest=# select upper(t) from t1;
>upper   
> ---
>  ABCããã
> (1 row)
> 
> So Japanese(including ASCII)/UNICODE behavior is perfectly 
> correct at this moment. 

Right, so you _never_ use accented ascii characters in Japanese? 
(like à for example, whose uppercase is Ã)

> So I strongly object removing that optimization.

I'm guessing this would call for a vote then, since if implementing ICU, then
I'd have to object to leaving it in.

Changing the bahaviour of ICU doesn't seem right. Changing the behaviour of pg, 
so that it works as it should when using unicode, seems the right solution to 
me.

> --
> Tatsuo Ishii
> 
> 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 12:01 PM
> To: [EMAIL PROTECTED]
> Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> 
> We have developed patches which relaxes the character 
> validation so that PostgreSQL accepts invalid characters. It 
> works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at 
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.

... John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Madison Kelly wrote:
>Under most circumstances I would agree with you completely. In my 
> case though I have to decide between risking a loss of a 
> user's data or 
> attempt to store the file name in some manner that would 
> return the same 
> name used by the file system.
> 
>The user (or one of his/her users in the case of an admin) may be 
> completely unaware of the file name being an invalid unicode 
> name. The 
> file itself though may still be quite valid and contain information 
> worthy of backing up. I could notify the user/admin that the 
> name is not 
> valid but there is no way I could rely on the name being 
> changed. Given 
> the choices, I would prefer to attempt to store/use the file 
> name with 
> the invalid unicode character than simply ignore the file.
> 
>Is there a way to store the name in raw binary? If so, 
> would this not 
> be safe because to postgresql it should no longer matter what 
> data is or 
> represents, right? Maybe there is a third option I am not yet 
> concidering?

Set the client_encoding to ascii when storing that name, and again when
retrieving it.
Or, use a bytea column.

> 
> Madison

... John

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Alvaro Herrera wrote:
> Sent: Sunday, May 08, 2005 2:49 PM
> To: John Hansen
> Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; 
> [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > Tatsuo Ishii wrote:
> 
> > > So Japanese(including ASCII)/UNICODE behavior is 
> perfectly correct 
> > > at this moment.
> > 
> > Right, so you _never_ use accented ascii characters in Japanese? 
> > (like è for example, whose uppercase is È)
> 
> That isn't ASCII.  It's latin1 or some other ASCII extension.

Point taken...
But...

If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, 
not UTF-8 (unicode).
UTF-8 encoded databases are very useful for representing multiple languages in 
the same database,
but this usefulness vanishes if functions like upper/lower doesn't work 
correctly.

So optimizing for 3 languages breaks more than a hundred, that's doesn't seem 
fair!

> 
> --
> Alvaro Herrera (<[EMAIL PROTECTED]>) "La principal 
> característica humana es la tontería"
> (Augusto Monterroso)
> 
> 

... John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Patch for collation using ICU

2005-05-08 Thread John Hansen
Tatsuo Ishii
> Sent: Sunday, May 08, 2005 3:41 PM
> To: John Hansen
> Cc: [EMAIL PROTECTED]; pgman@candle.pha.pa.us; 
> [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > Alvaro Herrera wrote:
> > > Sent: Sunday, May 08, 2005 2:49 PM
> > > To: John Hansen
> > > Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> > > pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > 
> > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > > Tatsuo Ishii wrote:
> > > 
> > > > > So Japanese(including ASCII)/UNICODE behavior is
> > > perfectly correct
> > > > > at this moment.
> > > > 
> > > > Right, so you _never_ use accented ascii characters in 
> Japanese? 
> > > > (like è for example, whose uppercase is È)
> > > 
> > > That isn't ASCII.  It's latin1 or some other ASCII extension.
> > 
> > Point taken...
> > But...
> > 
> > If you want EUC_JP (Japanese + ASCII) then use that as your 
> backend encoding, not UTF-8 (unicode).
> > UTF-8 encoded databases are very useful for representing multiple 
> > languages in the same database, but this usefulness 
> vanishes if functions like upper/lower doesn't work correctly.
> 
> I'm just curious if Germany/French/Spanish mixed text can be 
> sorted correctly. I think these languages need their own 
> locales even with UNICODE/ICU.

No, they will not sort correctly, for that you still need the locale.

> 
> > So optimizing for 3 languages breaks more than a hundred, 
> that's doesn't seem fair!

That is a compromise I'd be willing to agree on. :)
 
> Why don't you add a GUC variable or some such to control the 
> upper/lower behavior?
> --
> Tatsuo Ishii
> 
> 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 3:31 PM
> To: John Hansen
> Cc: [EMAIL PROTECTED]; pgsql-general@postgresql.org; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> 
> > Tatsuo Ishii wrote:
> > > Sent: Sunday, May 08, 2005 12:01 PM
> > > To: [EMAIL PROTECTED]
> > > Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> > > 
> > > We have developed patches which relaxes the character 
> validation so 
> > > that PostgreSQL accepts invalid characters. It works like this:
> > 
> > That is just plain 100% wrong!!
> >
> > Under no circumstances should there be invalid data in a database.
> > And if you're trying to make a database of invalid data, 
> then at least 
> > encode it using a valid encoding.
> > 
> > In fact, I've proposed strengthening the validation 
> routines for UTF-8.
> 
> Actually I myself thought as you are before. Later I found 
> that it was not so good idea. People already have invalid 
> encoded data in their precious database and have very hard 
> time to migrate to newer version of PostgreSQL because of 
> encoding validation.
> 
> Think about this kind of situation:
> 
> There is a table t1(member_id integer primary key, 
> member_name text, address text, phone text, email text). I 
> have to reach each member by either adress, phone or email. 
> Unfortunately some of address field have wrong encoded data. 
> In this case I will use phone or email to reach them. 
> 
> Now I need to upgrade to newer PostgreSQL within 1 day. I 
> know I have to fix wrong encoded field but it will take more 
> than 1 day. So I would like to import the data first then fix 
> wrong encoded field on running database since I can reach 
> members by phone or email even with wrong encoded address field...

Actually would be very simple, create function isvalidutf8(text) in your
preferred language.
C source is available from unicode.org.
Create function converttoutf8(text) using whatever code is required to
transform the _wrong_ encoding (SQL_ASCII -> UTF8 for instance) to
utf-8.
Update table set field=converttoutf8(field) where !isvalidutf8(field);

Now sit back and relax while your invalid data is converted to utf-8.

When done, pg_dump the database, upgrade, and reload.

This should take less than a day.

> I saw this kind of situation in the real world and that's why 
> we developed the patches.
> --
> Tatsuo Ishii
> 
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-08 Thread John Hansen
> The source for ICU 3.2 is 9.8Mb in .tar.gz. PostgreSQL 8.0.2 is 13.2.
> That means the size of the distribution would almost *double* 
> if we bundled ICU.

Ermm,. Don't forget to remove the current charset conversions and locale
support before making your size estimation.

> 
> It's probably fine bundling it in the binary distributions 
> (at least we'd probably do it on win32, since not many ppl 
> will have it already there), but bundling the source seems a 
> bit excessive to me.
> 
> //Magnus
> 
> 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-08 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 11:19 PM
> To: John Hansen
> Cc: [EMAIL PROTECTED]; pgman@candle.pha.pa.us; 
> [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > > > > Tatsuo Ishii wrote:
> > > > > 
> > > > > > > So Japanese(including ASCII)/UNICODE behavior is
> > > > > perfectly correct
> > > > > > > at this moment.
> > > > > > 
> > > > > > Right, so you _never_ use accented ascii characters in
> > > Japanese? 
> > > > > > (like è for example, whose uppercase is È)
> > > > > 
> > > > > That isn't ASCII.  It's latin1 or some other ASCII extension.
> > > > 
> > > > Point taken...
> > > > But...
> > > > 
> > > > If you want EUC_JP (Japanese + ASCII) then use that as your
> > > backend encoding, not UTF-8 (unicode).
> > > > UTF-8 encoded databases are very useful for 
> representing multiple 
> > > > languages in the same database, but this usefulness
> > > vanishes if functions like upper/lower doesn't work correctly.
> > > 
> > > I'm just curious if Germany/French/Spanish mixed text can 
> be sorted 
> > > correctly. I think these languages need their own locales 
> even with 
> > > UNICODE/ICU.
> > 
> > No, they will not sort correctly, for that you still need 
> the locale.
> 
> I'm confused. I thought the ICU patches is intended for using 
> on broken locale platforms?

Initially yes, but why duplicate code?
What I meant was, that they will not sort correctly using the C locale.
Locale _name_ needs to be known to ICU for it to sort correctly.

> --
> Tatsuo Ishii
> 
> 

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Patch for collation using ICU

2005-05-08 Thread John Hansen
Tom Lane wrote:
> Sent: Monday, May 09, 2005 2:47 AM
> To: Palle Girgensohn
> Cc: Tatsuo Ishii; John Hansen; [EMAIL PROTECTED]; 
> pgman@candle.pha.pa.us; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU 
> 
> Palle Girgensohn <[EMAIL PROTECTED]> writes:
> >> I'm confused. I thought the ICU patches is intended for using on 
> >> broken locale platforms?
> 
> > It will sort correctly in *one* locale, using ICU. You still cannot 
> > mix different locales in the same database cluster, the collation 
> > locale is still fixed at initdb time, unfortunately.
> 
> I thought the point of using ICU was to be able to dig out 
> from under that restriction?  It's a bit of a large pill to 
> swallow if we will still have to throw it away someday to 
> become SQL spec compliant.

That is not a limitation of ICU but of postgresql.
I don't know what the specs say, but imagine something like:
SELECT foo FROM bar ORDER BY foo WITH LOCALE 'en_US', foobar WITH LOCALE
'jp_JP';

Which would be less difficult to implement using ICU.

> 
>   regards, tom lane
> 
> 

... John

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-08 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 11:08 PM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > > I don't buy it. If current conversion tables does the 
> right thing, 
> > > why we need to replace. Or if conversion tables are not 
> correct, why 
> > > don't you fix it? I think the rule of character 
> conversion will not 
> > > change frequently, especially for LATIN languages. Thus 
> maintaining 
> > > cost is not too high.
> > 
> > I never said we need to, but if we're going to implement 
> ICU, then we 
> > might as well go all the way.
> 
> So you admit there's no benefit using ICU for replacing 
> existing conversions?
> 
> Besides ICU does not support all existing conversions, I 
> think ICU has serious flaw for using conversion. If I 
> understand correctly, ICU uses UNICODE internally to do the 
> conversion. For example, to implement
> SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> converts UNICODE to EUC_JP. Problem is these conversion is 
> not roud trip(conversion between SJIS/EUC_JP and UNICODE will 
> lose some information). Thus SJIS->EUC_JP->SJIS conversion 
> using ICU does not preserve original text.

Could you please send me a sample text as an attachment encoded in SJIS
where this would happen?

> --
> Tatsuo Ishii
> 
> 

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Patch for collation using ICU

2005-05-08 Thread John Hansen
 

> -Original Message-
> From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, May 08, 2005 11:08 PM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > > I don't buy it. If current conversion tables does the 
> right thing, 
> > > why we need to replace. Or if conversion tables are not 
> correct, why 
> > > don't you fix it? I think the rule of character 
> conversion will not 
> > > change frequently, especially for LATIN languages. Thus 
> maintaining 
> > > cost is not too high.
> > 
> > I never said we need to, but if we're going to implement 
> ICU, then we 
> > might as well go all the way.
> 
> So you admit there's no benefit using ICU for replacing 
> existing conversions?
> 
> Besides ICU does not support all existing conversions, I 
> think ICU has serious flaw for using conversion. If I 
> understand correctly, ICU uses UNICODE internally to do the 
> conversion. For example, to implement
> SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> converts UNICODE to EUC_JP. Problem is these conversion is 
> not roud trip(conversion between SJIS/EUC_JP and UNICODE will 
> lose some information). Thus SJIS->EUC_JP->SJIS conversion 
> using ICU does not preserve original text.

Just for the record, I fetched a web page encoded in sjis, and converted
it to euc-jp and back using uconv from ICU 3.2, and the result is the
original is identical to the transformed file.

 uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
 uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc
 diff index.html index.html.sjis

... John

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Patch for collation using ICU

2005-05-09 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Tuesday, May 10, 2005 12:32 AM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > > -Original Message-
> > > From: Tatsuo Ishii [mailto:[EMAIL PROTECTED]
> > > Sent: Sunday, May 08, 2005 11:08 PM
> > > To: John Hansen
> > > Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> > > pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > 
> > > > > I don't buy it. If current conversion tables does the
> > > right thing,
> > > > > why we need to replace. Or if conversion tables are not
> > > correct, why
> > > > > don't you fix it? I think the rule of character
> > > conversion will not
> > > > > change frequently, especially for LATIN languages. Thus
> > > maintaining
> > > > > cost is not too high.
> > > > 
> > > > I never said we need to, but if we're going to implement
> > > ICU, then we
> > > > might as well go all the way.
> > > 
> > > So you admit there's no benefit using ICU for replacing existing 
> > > conversions?
> > > 
> > > Besides ICU does not support all existing conversions, I 
> think ICU 
> > > has serious flaw for using conversion. If I understand correctly, 
> > > ICU uses UNICODE internally to do the conversion. For example, to 
> > > implement
> > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> > > converts UNICODE to EUC_JP. Problem is these conversion 
> is not roud 
> > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some 
> > > information). Thus SJIS->EUC_JP->SJIS conversion using 
> ICU does not 
> > > preserve original text.
> > 
> > Just for the record, I fetched a web page encoded in sjis, and 
> > converted it to euc-jp and back using uconv from ICU 3.2, and the 
> > result is the original is identical to the transformed file.
> > 
> >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html  
> uconv -f 
> > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff 
> index.html 
> > index.html.sjis
> 
> Not all SJIS/EUC_JP characters have the problem. You might want to
> try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
> 
> BTW, I got this with ICU 3.2:
> 
> $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt 
> Conversion from Unicode to codepage failed at input byte 
> position 0. Unicode: 301c Error: Invalid character found
> 
> The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.

That actually makes perfect sense, since according to unicode.org's
database:
301C ~ WAVE DASH
   This character was encoded to match JIS C 6226-1978 1-33 "wave
dash".
   The JIS standards and some industry practise disagree in mapping.
 - 3030 wavy dash
 - FF5E full width tilde

In PG FF5E is the mapping currently used. That is obviously wrong
(according to the standards), as that is only a 'similar character'.

Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis
doesn't define "WAVE DASH".
In all, I believe this behaviour to be correct according to the
standards.

There'd be nothing to stop us from defining alternative mappings for the
cases where we deviate from the standard, but the question is, should we
be non-standard?

> 
> This makes me nervous in using ICU...
> --
> Tatsuo Ishii
> 
> 

... John

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Patch for collation using ICU

2005-05-10 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Tuesday, May 10, 2005 5:45 PM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > Tatsuo Ishii wrote:
> > > Sent: Tuesday, May 10, 2005 12:32 AM
> > > To: John Hansen
> > > Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> > > pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > 
> > > > > -Original Message-
> > > > > From: Tatsuo Ishii [mailto:[EMAIL PROTECTED]
> > > > > Sent: Sunday, May 08, 2005 11:08 PM
> > > > > To: John Hansen
> > > > > Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> > > > > pgsql-hackers@postgresql.org
> > > > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > > > 
> > > > > > > I don't buy it. If current conversion tables does the
> > > > > right thing,
> > > > > > > why we need to replace. Or if conversion tables are not
> > > > > correct, why
> > > > > > > don't you fix it? I think the rule of character
> > > > > conversion will not
> > > > > > > change frequently, especially for LATIN languages. Thus
> > > > > maintaining
> > > > > > > cost is not too high.
> > > > > > 
> > > > > > I never said we need to, but if we're going to implement
> > > > > ICU, then we
> > > > > > might as well go all the way.
> > > > > 
> > > > > So you admit there's no benefit using ICU for 
> replacing existing 
> > > > > conversions?
> > > > > 
> > > > > Besides ICU does not support all existing conversions, I
> > > think ICU
> > > > > has serious flaw for using conversion. If I understand 
> > > > > correctly, ICU uses UNICODE internally to do the 
> conversion. For 
> > > > > example, to implement
> > > > > SJIS->EUC_JP conversion, ICU first converts SJIS to 
> UNICODE then
> > > > > converts UNICODE to EUC_JP. Problem is these conversion
> > > is not roud
> > > > > trip(conversion between SJIS/EUC_JP and UNICODE will 
> lose some 
> > > > > information). Thus SJIS->EUC_JP->SJIS conversion using
> > > ICU does not
> > > > > preserve original text.
> > > > 
> > > > Just for the record, I fetched a web page encoded in sjis, and 
> > > > converted it to euc-jp and back using uconv from ICU 
> 3.2, and the 
> > > > result is the original is identical to the transformed file.
> > > > 
> > > >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
> > > uconv -f
> > > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff
> > > index.html
> > > > index.html.sjis
> > > 
> > > Not all SJIS/EUC_JP characters have the problem. You might want to
> > > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
> > > 
> > > BTW, I got this with ICU 3.2:
> > > 
> > > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt 
> Conversion 
> > > from Unicode to codepage failed at input byte position 0. 
> Unicode: 
> > > 301c Error: Invalid character found
> > > 
> > > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.
> > 
> > That actually makes perfect sense, since according to unicode.org's
> > database:
> > 301C ~ WAVE DASH
> >This character was encoded to match JIS C 6226-1978 
> 1-33 "wave 
> > dash".
> >The JIS standards and some industry practise 
> disagree in mapping.
> >  - 3030 wavy dash
> >  - FF5E full width tilde
> > 
> > In PG FF5E is the mapping currently used. That is obviously wrong 
> > (according to the standards), as that is only a 'similar character'.
> > 
> > Unfortunately, there is no mapping from 301C to shift_jis, as 
> > shift_jis doesn't define "WAVE DASH".
> > In all, I believe this behaviour to be correct according to the 
> > standards.
> > 
> > There'd be nothing to stop us from defining alternative 
> mappings for 
> > the cases where we deviate from the standard, but the question is, 
> > should we be non-standard?
> 
> You missed the point. EUC_JP 0xa1c1 is a perfect valid data 
> and uconv -f EUC_JP -t Shift_JIS should convert it to 
> Shift_JIS 0x8160 regardless of the internal of uconv.

Studying ICU forther, I found that it works fine, provided you use the
_correct_ charset for the conversion..

a.txt contains 0x81 0x60
uconv -f ibm-943_P130-1999 -t EUC_JP a.txt -o b.txt
b.txt now contains 0xa1 0xc1
uconv -t ibm-943_P130-1999 -f EUC_JP b.txt -o a.txt
a.txt still contains 0x81 0x60

The mapping table you want is ibm-943_P130-1999
Similar, we'd need to find the right euc-jp (and plain jis) mapping,
assuming we want the one that strictly defines JIS X 0208 right?

I trust this to put your fears to rest...

> --
> Tatsuo Ishii
> 
> 

... John

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] SQL_ASCII vs. 7-bit ASCII encodings

2005-05-12 Thread John Hansen
> Personally, I'd like UTF8 to be the default encoding :)  This 
> is the 21st century :D

I concur.

... John

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Returning the name of a primary key

2005-05-17 Thread John Hansen
Tom, Juan,

Wouldn't this simple SQL do the trick?

CREATE OR REPLACE FUNCTION pk_column(text) RETURNS SETOF text
AS '
SELECT attname::text
FROM pg_class, pg_constraint, pg_attribute
WHERE pg_class.oid = conrelid
AND contype=''p''
AND attrelid = pg_class.oid
AND attnum = ANY (conkey)
AND relname=$1;
'
LANGUAGE sql VOLATILE STRICT; 

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
> Sent: Tuesday, May 17, 2005 4:49 AM
> To: Juan Pablo Espino
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Returning the name of a primary key 
> 
> Juan Pablo Espino <[EMAIL PROTECTED]> writes:
> > I need to write a function that retrieve the name of at least one 
> > table primary key, if it exists.  The only argument passed to the 
> > function is the table name.  I have thought something like this:
> 
> You need to be searching the list of indexes, not the 
> attributes per se.
> ATExecDropNotNull() might be a useful example.
> 
>   regards, tom lane
> 
> ---(end of 
> broadcast)---
> TIP 9: the planner will ignore your desire to choose an index 
> scan if your
>   joining column's datatypes do not match
> 
> 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] character type value is not padded with spaces

2005-05-23 Thread John Hansen
Ahemm,...

UNICODE DB:

create table t (a char(10));
set client_encoding = iso88591;
insert into t VALUES ('æøå');

select a, octet_length(a),length(a) from t;
 a  | octet_length | length
+--+
 æøå|   13 |  3
(1 row)

This is with 8.0.2.

Just FYI.

... John

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Tatsuo Ishii
> Sent: Tuesday, May 24, 2005 8:52 AM
> To: [EMAIL PROTECTED]
> Cc: pgsql-patches@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [PATCHES] character type value is not padded with spaces
> 
> Hackers,
> 
> The problem he found is not only existing in Japanese 
> characters but also in any multibyte encodings including 
> UTF-8. For me the patch looks good and I will commit it to 
> 7.3, 7.4, 8.0 stables and current if there's no objection.
> --
> Tatsuo Ishii
> 
> > Character type value including multibyte characters is not 
> padded with 
> > spaces. It reproduces at 7.3.x, 7.4.x and 8.0.x.
> > 
> >create table t (a char(10));
> >insert into t values ('X'); -- X is 2byte character.
> > 
> > I expect that 'X ' is inserted. But 'X' is inserted.
> > 
> >select a, octed_length(a) from t;
> > 
> >   a   | octet_length 
> >---+--
> > X |   10
> > 
> > If padded with spaces, octet_length(a) is 15. This problem 
> is caused 
> > that string length is calculated by byte length(VARSIZE) in 
> > exprTypmod().
> > 
> > I attache the patch for this problem.
> > 
> > Regards,
> > 
> > --
> > Yoshiyuki Asaba
> > [EMAIL PROTECTED]
> 
> ---(end of 
> broadcast)---
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to 
> [EMAIL PROTECTED])
> 
> 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [PATCHES] character type value is not padded with spaces

2005-05-23 Thread John Hansen
Ahhh... 

> -Original Message-
> From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, May 24, 2005 9:26 AM
> To: John Hansen
> Cc: [EMAIL PROTECTED]; pgsql-patches@postgresql.org; 
> pgsql-hackers@postgresql.org
> Subject: Re: [PATCHES] character type value is not padded with spaces
> 
> I think you need to test with 5 characters, not 3.
> --
> Tatsuo Ishii
> 
> > Ahemm,...
> > 
> > UNICODE DB:
> > 
> > create table t (a char(10));
> > set client_encoding = iso88591;
> > insert into t VALUES ('æøå');
> > 
> > select a, octet_length(a),length(a) from t;
> >  a  | octet_length | length
> > +--+
> >  æøå|   13 |  3
> > (1 row)
> > 
> > This is with 8.0.2.
> > 
> > Just FYI.
> > 
> > ... John
> > 
> > > -Original Message-
> > > From: [EMAIL PROTECTED]
> > > [mailto:[EMAIL PROTECTED] On Behalf Of Tatsuo 
> > > Ishii
> > > Sent: Tuesday, May 24, 2005 8:52 AM
> > > To: [EMAIL PROTECTED]
> > > Cc: pgsql-patches@postgresql.org; pgsql-hackers@postgresql.org
> > > Subject: Re: [PATCHES] character type value is not padded with 
> > > spaces
> > > 
> > > Hackers,
> > > 
> > > The problem he found is not only existing in Japanese 
> characters but 
> > > also in any multibyte encodings including UTF-8. For me the patch 
> > > looks good and I will commit it to 7.3, 7.4, 8.0 stables 
> and current 
> > > if there's no objection.
> > > --
> > > Tatsuo Ishii
> > > 
> > > > Character type value including multibyte characters is not
> > > padded with
> > > > spaces. It reproduces at 7.3.x, 7.4.x and 8.0.x.
> > > > 
> > > >create table t (a char(10));
> > > >insert into t values ('X'); -- X is 2byte character.
> > > > 
> > > > I expect that 'X ' is inserted. But 'X' is inserted.
> > > > 
> > > >select a, octed_length(a) from t;
> > > > 
> > > >   a   | octet_length 
> > > >---+--
> > > > X |   10
> > > > 
> > > > If padded with spaces, octet_length(a) is 15. This problem
> > > is caused
> > > > that string length is calculated by byte length(VARSIZE) in 
> > > > exprTypmod().
> > > > 
> > > > I attache the patch for this problem.
> > > > 
> > > > Regards,
> > > > 
> > > > --
> > > > Yoshiyuki Asaba
> > > > [EMAIL PROTECTED]
> > > 
> > > ---(end of
> > > broadcast)---
> > > TIP 2: you can get off all lists at once with the 
> unregister command
> > > (send "unregister YourEmailAddressHere" to
> > > [EMAIL PROTECTED])
> > > 
> > > 
> > 
> 
> 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] executing OS programs from pg

2005-06-03 Thread John Hansen
Look at peter eisentraut's procedural language PL/sh
It's on pgfoundry.

... John

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
> Sent: Saturday, June 04, 2005 5:16 AM
> To: Gevik babakhani
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] executing OS programs from pg 
> 
> "Gevik babakhani" <[EMAIL PROTECTED]> writes:
> > Does anyone know how to execute an OS command from pgsql. I 
> would like 
> > to create a trigger that op on firing would run/execute an 
> external program.
> 
> Use any of the "untrusted" PLs to execute system() or the like.
> 
> Whether this is a good idea or not is a different question 
> --- there are excellent reasons why it is a *bad* idea to 
> execute outside-the-database actions from within a trigger.  
> Mainly that the actions won't be undone if the transaction 
> later rolls back, and now your database state is inconsistent 
> with outside-the-database state.  See the list archives for 
> many past discussions of this point and safer ways to design 
> your application.
> 
> (BTW, this is hardly material for -hackers.)
> 
>   regards, tom lane
> 
> ---(end of 
> broadcast)---
> TIP 5: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq
> 
> 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Unicode characters above 0x10000 #2

2005-06-04 Thread John Hansen
Bruce,

Attached patch replaces the original, applied today against CVS HEAD.
Fixes the surrogates, and limits to 4 byte utf8 as per spec.

Also extends UtfToLocal to 4 byte characters (tho, it does not add any,
just enables the code to handle them. If my interpretation of this code
is wrong, please let me know, and correct it).

... John

> -Original Message-
> From: Bruce Momjian [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, June 05, 2005 11:23 AM
> To: pgman@candle.pha.pa.us
> Cc: John Hansen; pgsql-hackers@postgresql.org; PostgreSQL-patches
> Subject: Re: [PATCHES] Unicode characters above 0x1 #2
> 
> 
> Your patch has been added to the PostgreSQL unapplied patches list at:
> 
>   http://momjian.postgresql.org/cgi-bin/pgpatches
> 
> It will be applied as soon as one of the PostgreSQL 
> committers reviews and approves it.
> 
> --
> -
> 
> 
> pgman wrote:
> > 
> > I have backed out this patch.  It is unclear it is a bug fix.
> > 
> > It will be saved for 8.1.
> > 
> > 
> --
> > -
> > 
> > pgman wrote:
> > > 
> > > Patch applied.  Thanks.
> > > 
> > > 
> 
> > > ---
> > > 
> > > 
> > > John Hansen wrote:
> > > > 3 times lucky?
> > > > 
> > > > Last one broke utf8 G
> > > > 
> > > > This one works, Too tired, sorry for the inconvenience..
> > > > 
> > > > ... John
> > > 
> > > Content-Description: cvs.diff
> > > 
> > > [ Attachment, skipping... ]
> > > 
> > > > 
> > > > ---(end of 
> > > > broadcast)---
> > > > TIP 9: the planner will ignore your desire to choose an 
> index scan if your
> > > >   joining column's datatypes do not match
> > > 
> > > -- 
> > >   Bruce Momjian|  http://candle.pha.pa.us
> > >   pgman@candle.pha.pa.us   |  (610) 359-1001
> > >   +  If your life is a hard drive, |  13 Roberts Road
> > >   +  Christ can be your backup.|  Newtown Square, 
> Pennsylvania 19073
> > 
> > -- 
> >   Bruce Momjian|  http://candle.pha.pa.us
> >   pgman@candle.pha.pa.us   |  (610) 359-1001
> >   +  If your life is a hard drive, |  13 Roberts Road
> >   +  Christ can be your backup.|  Newtown Square, 
> Pennsylvania 19073
> 
> > ===
> > RCS file: /projects/cvsroot/pgsql/src/backend/utils/mb/wchar.c,v
> > retrieving revision 1.38
> > diff -c -r1.38 wchar.c
> > *** src/backend/utils/mb/wchar.c17 Sep 2004 21:59:57 
> - 1.38
> > --- src/backend/utils/mb/wchar.c21 Nov 2004 09:58:36 -
> > ***
> > *** 343,348 
> > --- 343,373 
> > return (pg_euc_dsplen(s));
> >   }
> >   
> > + bool isLegalUTF8(const UTF8 *source, int len) {
> > + UTF8 a;
> > + const UTF8 *srcptr = source+len;
> > + if(!source || (pg_utf_mblen(source) != len)) return false;
> > + switch (len) {
> > + default: return false;
> > + /* Everything else falls through when "true"... */
> > + case 6: if ((a = (*--srcptr)) < 0x80 || a > 
> 0xBF) return false;
> > + case 5: if ((a = (*--srcptr)) < 0x80 || a > 
> 0xBF) return false;
> > + case 4: if ((a = (*--srcptr)) < 0x80 || a > 
> 0xBF) return false;
> > + case 3: if ((a = (*--srcptr)) < 0x80 || a > 
> 0xBF) return false;
> > + case 2: if ((a = (*--srcptr)) > 0xBF) return false;
> > + switch (*source) {
> > + /* no fall-through in this inner switch */
> > + case 0xE0: if (a < 0xA0) return false; break;
> > + case 0xF0: if (a < 0x90) return false; break;
> > + case 0xF4: if (a > 0x8F) return false; break;
> > + default:  if (a < 0x80) return false;
> > + }
> > + case 1: if (*source >= 0x80 && *source < 
> 0xC2) return false;
> > + if (*source > 0xFD) return false;
> > + }
> > +

Re: [HACKERS] unicode upper/lower functions

2005-06-06 Thread John Hansen
Yes,

Thank you! :)

.. John 

> -Original Message-
> From: Bruce Momjian [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, June 07, 2005 10:07 AM
> To: John Hansen
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] unicode upper/lower functions
> 
> 
> I think we have decided to use the ICU library to implement 
> multiple locales.
> 
> ------
> -
> 
> John Hansen wrote:
> > Hi list,
> > 
> > 
> > Attached for your perusal, unicode versions of upper/lower, 
> that work 
> > independent of locale except for the following languages:
> > 
> > Turkish, Azeri, and Lithuanian.
> > There are 15 locale specific cases in total not covered.
> > 
> > 
> > --
> > John Hansen <[EMAIL PROTECTED]>
> > GeekNET
> 
> [ Attachment, skipping... ]
> 
> > 
> > ---(end of 
> > broadcast)---
> > TIP 7: don't forget to increase your free space map settings
> 
> -- 
>   Bruce Momjian|  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us   |  (610) 359-1001
>   +  If your life is a hard drive, |  13 Roberts Road
>   +  Christ can be your backup.|  Newtown Square, 
> Pennsylvania 19073
> 
> 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] unicode upper/lower functions

2005-06-06 Thread John Hansen
... Except,.. It was never decided if the 'C' locale optimisations was
going to be removed if/when implementing ICU.

Tho I think the  conclusion was a postgresql.conf parameter to
enable/disable the optimisations.
Either way, this code is now obsolete.

... John 

> -Original Message-
> From: Bruce Momjian [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, June 07, 2005 10:07 AM
> To: John Hansen
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] unicode upper/lower functions
> 
> 
> I think we have decided to use the ICU library to implement 
> multiple locales.
> 
> ------
> -
> 
> John Hansen wrote:
> > Hi list,
> > 
> > 
> > Attached for your perusal, unicode versions of upper/lower, 
> that work 
> > independent of locale except for the following languages:
> > 
> > Turkish, Azeri, and Lithuanian.
> > There are 15 locale specific cases in total not covered.
> > 
> > 
> > --
> > John Hansen <[EMAIL PROTECTED]>
> > GeekNET
> 
> [ Attachment, skipping... ]
> 
> > 
> > ---(end of 
> > broadcast)---
> > TIP 7: don't forget to increase your free space map settings
> 
> -- 
>   Bruce Momjian|  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us   |  (610) 359-1001
>   +  If your life is a hard drive, |  13 Roberts Road
>   +  Christ can be your backup.|  Newtown Square, 
> Pennsylvania 19073
> 
> 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] unicode upper/lower functions

2005-06-06 Thread John Hansen
Bruce Momjian wrote:
> John Hansen wrote:
> > ... Except,.. It was never decided if the 'C' locale 
> optimisations was 
> > going to be removed if/when implementing ICU.
> 
> Uh, why would we remove it?  Oh, meaning if the locale is C 
> we bypass locale lookups?  I think we will have to see what 
> performance we have with things.

Uhh, not quite: If locale is 'C' the current assumption is 7-bit ASCII
for upper/lower/initcap.
ICU is capable of properky doing upper/lower/initcap except for the
cases described in this (obsolete) patch.

> 
> > Tho I think the  conclusion was a postgresql.conf parameter to 
> > enable/disable the optimisations.
> > Either way, this code is now obsolete.
> 
> Thanks.
> 
> -- 
>   Bruce Momjian|  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us   |  (610) 359-1001
>   +  If your life is a hard drive, |  13 Roberts Road
>   +  Christ can be your backup.|  Newtown Square, 
> Pennsylvania 19073
> 
> 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] PG_FREE_IF_COPY()

2005-06-14 Thread John Hansen
Given the following snippet:

HeapTupleHeader tuple;
Datum   temp;
bool  isnull;

tuple = PG_GETARG_HEAPTUPLEHEADER(0);
temp  = GetAttributeByName(tuple, "data", &isnull);


When using this for a btree operator functions, you need to
PG_FREE_IF_COPY(?,?);

Which of the above parameters need I free?

tuple, or temp, Or both?

... John

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] PG_FREE_IF_COPY()

2005-06-14 Thread John Hansen
Tom Lane [mailto:[EMAIL PROTECTED] wrote:
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > Given the following snippet:
> > HeapTupleHeader tuple;
> > Datum   temp;
> > bool  isnull;
>   
> > tuple = PG_GETARG_HEAPTUPLEHEADER(0);
> > temp  = GetAttributeByName(tuple, "data", &isnull);
> 
> > When using this for a btree operator functions, you need to 
> > PG_FREE_IF_COPY(?,?);
> 
> > Which of the above parameters need I free?
> 
> It'd probably be wise to do a FREE_IF_COPY on the tuple.
> 
> GetAttributeByName, however, hasn't copied anything (if it's 
> a pass-by-ref type you'll just get a pointer into the tuple).
> If you do a copy further down that might be a reason to free 
> something, but this code alone isn't making a copy.

Cool, thanks...

> 
>   regards, tom lane
> 
> 

... John

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] LGPL

2005-06-14 Thread John Hansen
Is there any reason why we would not be able to use LGPL code in PG?

... John

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] LGPL

2005-06-14 Thread John Hansen
What about GPL ?
I assume that's out of the question!

> -Original Message-
> From: Marc G. Fournier [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, June 15, 2005 11:59 AM
> To: John Hansen
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] LGPL
> 
> 
> We already do ... libreadline ...
> 
> On Wed, 15 Jun 2005, John Hansen wrote:
> 
> > Is there any reason why we would not be able to use LGPL code in PG?
> >
> > ... John
> >
> > ---(end of 
> > broadcast)---
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> >  subscribe-nomail command to [EMAIL PROTECTED] 
> so that your
> >  message can get through to the mailing list cleanly
> >
> >
> >
> 
> 
> Marc G. Fournier   Hub.Org Networking Services 
> (http://www.hub.org)
> Email: [EMAIL PROTECTED]   Yahoo!: yscrappy 
>  ICQ: 7615664
> 
> 

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] LGPL

2005-06-14 Thread John Hansen
Ooooh

I got the impression that using GPL libraries was a Bad Thing(tm)

... John 

> -Original Message-
> From: Andrew Dunstan [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, June 15, 2005 12:15 PM
> To: Marc G. Fournier
> Cc: John Hansen; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] LGPL
> 
> Er, no. It's GPL, not LGPL software. My readline.h says:
> 
>The GNU Readline Library is free software; you can redistribute it
>and/or modify it under the terms of the GNU General Public License
>as published by the Free Software Foundation; either version 2, or
>(at your option) any later version.
> 
> 
> see this ancient thread: 
> http://archives.postgresql.org/pgsql-hackers/2000-12/msg01029.php
> 
> cheers
> 
> andrew
> 
> 
> Marc G. Fournier wrote:
> 
> >
> > We already do ... libreadline ...
> >
> > On Wed, 15 Jun 2005, John Hansen wrote:
> >
> >> Is there any reason why we would not be able to use LGPL 
> code in PG?
> >>
> >> ... John
> >>
> >> ---(end of 
> >> broadcast)---
> >> TIP 3: if posting/reading through Usenet, please send an 
> appropriate
> >>  subscribe-nomail command to [EMAIL PROTECTED] 
> so that your
> >>  message can get through to the mailing list cleanly
> >>
> >>
> >>
> >
> > 
> > Marc G. Fournier   Hub.Org Networking Services 
> > (http://www.hub.org)
> > Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 
> > 7615664
> >
> > ---(end of 
> > broadcast)---
> > TIP 4: Don't 'kill -9' the postmaster
> >
> 
> 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] LGPL

2005-06-14 Thread John Hansen
So, what's the story with readline? 

> -Original Message-
> From: Bruce Momjian [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, June 15, 2005 12:11 PM
> To: John Hansen
> Cc: Marc G. Fournier; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] LGPL
> 
> John Hansen wrote:
> > What about GPL ?
> > I assume that's out of the question!
> 
> If we add some GPL code, the entire binary becomes GPL, and 
> that prevents closed-source commercial versions from being produced.
> 
> --
> -
> 
> 
> > 
> > > -Original Message-
> > > From: Marc G. Fournier [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, June 15, 2005 11:59 AM
> > > To: John Hansen
> > > Cc: pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] LGPL
> > > 
> > > 
> > > We already do ... libreadline ...
> > > 
> > > On Wed, 15 Jun 2005, John Hansen wrote:
> > > 
> > > > Is there any reason why we would not be able to use 
> LGPL code in PG?
> > > >
> > > > ... John
> > > >
> > > > ---(end of
> > > > broadcast)---
> > > > TIP 3: if posting/reading through Usenet, please send 
> an appropriate
> > > >  subscribe-nomail command to [EMAIL PROTECTED]
> > > so that your
> > > >  message can get through to the mailing list cleanly
> > > >
> > > >
> > > >
> > > 
> > > 
> > > Marc G. Fournier   Hub.Org Networking Services 
> > > (http://www.hub.org)
> > > Email: [EMAIL PROTECTED]   Yahoo!: yscrappy 
> > >  ICQ: 7615664
> > > 
> > > 
> > 
> > ---(end of 
> > broadcast)---
> > TIP 1: subscribe and unsubscribe commands go to 
> > [EMAIL PROTECTED]
> > 
> 
> -- 
>   Bruce Momjian|  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us   |  (610) 359-1001
>   +  If your life is a hard drive, |  13 Roberts Road
>   +  Christ can be your backup.|  Newtown Square, 
> Pennsylvania 19073
> 
> 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] LGPL

2005-06-14 Thread John Hansen
> Agreed.
> 
> With libreadline, we are not taking their code or 
> distributing it, but merely linking to it if it exists.  Now, 
> some say that is enough to make us GPL, but many don't agree 
> with that interpretation.


Right,. That's actually exactly what I meant: using GPL/LGPL libraries
by linking to them.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] LGPL

2005-06-14 Thread John Hansen
Tom Lane [mailto:[EMAIL PROTECTED] Wrote:
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > Is there any reason why we would not be able to use LGPL code in PG?
> 
> Another point of view on this: it's OK to use LGPL code if 
> it's available on the local platform, so long as we don't 
> *require* it to be present.  It's even safer if the LGPL code 
> is merely one implementation of an API that has other 
> implementations under different licenses.
> For instance I have no fear at all of linking to glibc, and 
> little of linking to libreadline (the latter because we can 
> also use the BSD libedit).
> 
> If we could not build without libreadline then we would have 
> a very big problem.  And we certainly aren't going to 
> textually incorporate any new LGPL (or GPL) code into our 
> distribution.

Right,... Let me be more specific then,

What are your thoughts on using the glib
(http://developer.gnome.org/doc/API/2.2/glib/index.html) library for
some functionality in pg?

Additionally,. I came across this fine library
(http://home.gna.org/uri/uri.en.html) which I'd like to use as a base
for a new URI type, unfortunately it's GPL, so based on the above I'm
guessing using it as is, is out of the question?

> 
>   regards, tom lane
> 
> 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] LGPL

2005-06-14 Thread John Hansen
Tom Lane [mailto:[EMAIL PROTECTED] Wrote:
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > Right,... Let me be more specific then,
> 
> > What are your thoughts on using the glib
> > (http://developer.gnome.org/doc/API/2.2/glib/index.html) 
> library for 
> > some functionality in pg?
> 
> Right offhand that seems like a nonstarter.  Exactly how 
> would you use it in a way that didn't turn it into a required 
> component?  It looks to me like a collection of bits that are 
> pretty useful but also very low-level, and hence not easily separable.

K, that's what confused me as I got the impression it was ok to require
LGPL libraries but not GPL.

> 
> > Additionally,. I came across this fine library
> > (http://home.gna.org/uri/uri.en.html) which I'd like to use 
> as a base 
> > for a new URI type, unfortunately it's GPL, so based on the 
> above I'm 
> > guessing using it as is, is out of the question?
> 
> Sure, you can do whatever you like with that ... as long as 
> you're not expecting us to distribute the combined code as 
> part of Postgres.
> 
> It's worth reiterating here that GPL/LGPL code plus BSD code 
> is no problem whatever for local development and use.  It's 
> only if you want to redistribute the result that you have to 
> worry about what the licenses require.  Since Postgres is a 
> BSD-license project, *we* are not going to redistribute any 
> GPL or LGPL code, nor any code that fundamentally depends on 
> code that is so licensed.  But you can pretty much do what 
> you like in your own sandbox.  In particular, you could 
> develop a datatype that requires a GPL/LGPL library, and then 
> distribute that code by itself as GPL/LGPL, and neither the 
> GPL nor BSD camps would have any problem with that.  Just 
> don't expect us to put such code in a BSD distribution ...

That's what I was afraid of

> 
>   regards, tom lane
> 
> 

... John

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] LGPL

2005-06-15 Thread John Hansen
Josh,

> Both of these would be fine as add-ins to be distributed *separately*
through 
> pgFoundry or even the mirrors if they prove popular.   
> Bundling them in unified distribution binaries with PostgreSQL would
be a 
> significant problem. 
> 
> You see this in other projects all the time:  "Requriements: 
> __, which is GPL and can be downloaded from __ ." 
>  We've managed so far to avoid needing external libraries 
> which are not standard on most POSIX platforms, and it would 
> be nice to keep it that way instead of doing the "component 
> easter egg hunt" (which users of Linux multimedia apps are 
> familiar with).
> 
> This means that you're unlikely to be able to use glib unless 
> it becomes standard on POSIX platforms, and someone makes a 
> Windows port.  Out of curiosity, what did you want to use it *for*?

Ohh,. Just felt like a cleaner interface than ICU,

> As for a URI type, I don't see the problem with doing that as 
> a PostgreSQL add-in downloadable from PGFoundry.  Given the 
> variety of URI implementations, I'm not sure we'd want a 
> single URI type as standard anyway.  

That I don't know, Yet... However what I've come up with so far, has
proven quite useful.

Remember the all famous 'email' type?
This has the same functionality, if not better

create table email_addresses (email_address text,uri uri);
insert into email_addresses (email_address) VALUES
('[EMAIL PROTECTED]');
update email_addresses set uri = 'mailto:'||email_address::text; <--
could be a rule on insert!
select (uri).username,(uri).host from email_addresses ;
 username |  host
--+
 john | geeknet.com.au
(1 row)

And here's the really funky bit:

select email_address = '[EMAIL PROTECTED]' from email_addresses ;
 ?column?
--
 f
(1 row)

select uri = 'mailto:[EMAIL PROTECTED]'::text::uri from
email_addresses ;
 ?column?
--
 t
(1 row)

As it should, since email sent to the two email addresses would end up
in the same mailbox
The same applies to other URI formats of course. It appears this uri
library is fully spec compliant.

> According to the FSF's junior licensing maven, building in a 
> GPL data type or other plug-in would make *your instance* of 
> PostgreSQL GPL, but so does PL/R and PostGIS, so that's 
> nothing new.  It just needs to be distributed separately.
> 
> FYI, the reason the GPL "linking" issue is vague is that it 
> depends on local copyright law, which varies from country to 
> country and in the US from state to state.  This is 
> deliberate by the FSF because an agreement which depends on 
> local copyright law is stronger in court than one which sets its own 
> explicit terms.   If anyone has nuts-and-bolts questions 
> about GPL/LGPL 
> issues, I have some friends at the FSF and can get answers 
> from "the horse's mouth."
> 

Thanks for the explanation

> -- 
> Josh Berkus
> Aglio Database Solutions
> San Francisco
> 
> 

... John

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] max_fsm_pages >800k ... ?

2005-06-15 Thread John Hansen
INFO:  analyzing "pg_catalog.pg_depend"
INFO:  "pg_depend": 27 pages, 3866 rows sampled, 3866 estimated total
rows
INFO:  free space map: 423 relations, 88475 pages stored; 431200 total
pages needed
DETAIL:  Allocated FSM size: 4000 relations + 8 pages = 705 kB
shared memory. 

This, on a database with a moderate amount of rows (<10 Million)
So,.. I'd say fairly normal.

... John
> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Marc 
> G. Fournier
> Sent: Thursday, June 16, 2005 3:57 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] max_fsm_pages >800k ... ?
> 
> 
> Without more information to provide at this time, does the 
> following seem "unusual"?
> 
> INFO:  free space map: 252 relations, 411494 pages stored; 
> 738640 total pages needed
> DETAIL:  Allocated FSM size: 2000 relations + 40 pages = 
> 2463 kB shared memory.
> 
> I know there are alot of factors involved in the above, which 
> I'm looking into, but when I first saw the above, I just 
> about had a heartattack, only because i've never seen such 
> high #s for total pages needed ...
> 
> ... Is this something that others are seeing as relatively normal?
> 
> 
> 
> 
> 
> Marc G. Fournier   Hub.Org Networking Services 
> (http://www.hub.org)
> Email: [EMAIL PROTECTED]   Yahoo!: yscrappy 
>  ICQ: 7615664
> 
> ---(end of 
> broadcast)---
> TIP 8: explain analyze is your friend
> 
> 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] Returning Composite Types from C functions

2005-06-18 Thread John Hansen
Hi all,

CREATE TYPE my_type AS (
a int,
b int,
c int,
d int,
e int
);

CREATE FUNCTION text_to_my_type(text)
RETURNS my_type
AS 'my_lib.so'
LANGUAGE 'C' IMMUTABLE STRICT;


CREATE CAST (text AS my_type) WITH FUNCTION text_to_my_type (text);

SELECT ('1:2:3:4:5'::text::my_type).*;

This results in the text_to_my_type(text) function being called no less
than 5 times. Once for each element.

Is this the desired behaviour, or a bug?


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Returning Composite Types from C functions

2005-06-18 Thread John Hansen
Michael Fuhr [mailto:[EMAIL PROTECTED] Wrote:
> Sent: Saturday, June 18, 2005 9:56 PM
> To: John Hansen
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Returning Composite Types from C functions
> 
> On Sat, Jun 18, 2005 at 09:18:34PM +1000, John Hansen wrote:
> > 
> > SELECT ('1:2:3:4:5'::text::my_type).*;
> > 
> > This results in the text_to_my_type(text) function being called no 
> > less than 5 times. Once for each element.
> > 
> > Is this the desired behaviour, or a bug?
> 
> It's a known behavior with functions that return composite types.
> Apparently it's not easy to fix:
> 
> http://archives.postgresql.org/pgsql-hackers/2005-04/msg00971.php

There is a workaround tho, so should be fixable:

SELECT (a.b).* FROM (SELECT ('1:2:3:4:5'::text::my_type) AS b) AS a;

Or am I missing something?

> 
> --
> Michael Fuhr
> http://www.fuhr.org/~mfuhr/
> 
> 

... John


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Returning Composite Types from C functions

2005-06-18 Thread John Hansen
Yes, it worked for me,... 

But my point is the workaround shouldn't be nescessary 

> -Original Message-
> From: Tom Lane [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, June 18, 2005 11:36 PM
> To: John Hansen
> Cc: Michael Fuhr; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Returning Composite Types from C functions 
> 
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > There is a workaround tho, so should be fixable:
> > SELECT (a.b).* FROM (SELECT ('1:2:3:4:5'::text::my_type) AS 
> b) AS a; 
> > Or am I missing something?
> 
> Try it ;-)
> 
>   regards, tom lane
> 
> 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] query plan ignoring check constraints

2005-06-20 Thread John Hansen
Someone Wrote:

> Should not check constraint act as the first filter? The index should 
> ideally be scanned only when the check constraint is passed by the
search 
> criteria but surprisingly it did not happen. The explain analyze
showed 
> cost for index scans of subtables that cannot contain rows matching
the 
> search criteria.

Obviously, indexes on columns with a check constraint, should be
qualified with the same check constraint.

test=# CREATE TABLE test (
   foo text check(foo IN ('YES','NO'))
);
CREATE TABLE
test=# CREATE INDEX text_foo_idx ON test (foo) WHERE foo IN('YES','NO');
CREATE INDEX
test=# INSERT INTO test VALUES ('YES');
INSERT 280188 1
test=# INSERT INTO test VALUES ('NO');
INSERT 280189 1
test=# INSERT INTO test VALUES ('no');
ERROR:  new row for relation "test" violates check constraint
"test_foo_check"
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE foo = 'YES';
 QUERY PLAN



 Index Scan using text_foo_idx on test  (cost=0.00..5.82 rows=7
width=32) (actual time=0.369..0.376 rows=1 loops=1)
   Index Cond: (foo = 'YES'::text)
 Total runtime: 0.490 ms
(3 rows)
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE foo = 'no';
   QUERY PLAN



 Seq Scan on test  (cost=0.00..25.38 rows=7 width=32) (actual
time=0.358..0.358 rows=0 loops=1)
   Filter: (foo = 'no'::text)
 Total runtime: 0.421 ms
(3 rows)
test=# 

... John

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] query plan ignoring check constraints

2005-06-21 Thread John Hansen
Bruno Wolff III [mailto:[EMAIL PROTECTED] Wrote

> I think the real problem is that check constraints on tables 
> aren't used by the optimizer. Given that, what you have below 
> is expected.
> There has been talk about that in the past, but I haven't 
> heard anything recently about someone considering implenting that.
> 
> For your problem consider not using a partial index. It isn't 
> going to save anything if it has a constraint matching that 
> of the table.


Ahh, I get it now,... 

  If a column has a CHECK (col IN (1,2,3)) and a query says .. WHERE col
= 4; then the planner should 
  know that the query will return 0 rows, right?

... John

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] query plan ignoring check constraints

2005-06-21 Thread John Hansen
Bruno Wolff III [mailto:[EMAIL PROTECTED] Wrote:

> You only want to use partial indexes when they don't cover 
> the whole table. They make sense to enforce uniqueness of a 
> column under some condition and when you can save significant 
> space (becuase the condition is only satisfied for a small 
> fraction of rows).


Yes, I know that,. 

I misunderstood the original post as a request for queries NOT to use
indexes where it doesn't match the table contents.

.. John

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] GiST rtree logic is not right

2005-06-22 Thread John Hansen
> I'll look at problem after GiST concurrency. Fixing 
> rtree_gist is bug a fix, not a new feature, so I'm not 
> limited by 1 July.

Wont fixing rtree(_gist) require initdb, since the behaviour of the
operators will change?

... John


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] Function's LEAST, GREATEST and DECODE (Oracle vararg polymorphic functions)

2005-06-24 Thread John Hansen
I'd vote that these functions should follow the semantics of the <, and
> operators.

(NULL < x) is NULL;

... John 


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
> Sent: Friday, June 24, 2005 11:21 PM
> To: Pavel Stehule
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PATCHES] Function's LEAST, GREATEST 
> and DECODE (Oracle vararg polymorphic functions) 
> 
> [ moving to -hackers for a wider audience ]
> 
> Today's issue: should the GREATEST/LEAST functions be strict 
> (return null if any input is null) or not (return null only 
> if all inputs are null, else return the largest/smallest of 
> the non-null inputs)?
> 
> Pavel Stehule <[EMAIL PROTECTED]> writes:
> > On Thu, 23 Jun 2005, Tom Lane wrote:
> >> Pavel Stehule <[EMAIL PROTECTED]> writes:
> >> +  /* If any argument is null, then result 
> is null (for GREATEST 
> >> + and LEAST)*/
> >> 
> >> Are you sure about that?  The only reference I could find 
> says that 
> >> these functions are not strict in Oracle:
> >> 
> >> 
> http://download-east.oracle.com/otn_hosted_doc/rdb/pdf/sql_ref_v71_vo
> >> l1.pdf
> >> on page 2-185:
> >> 
> >>> The NULL keyword can appear in the list but is ignored. 
> However, not 
> >>> all value expressions can be specified as NULL. That is, 
> a non-NULL 
> >>> value expression must be in the list so that the data 
> type for the 
> >>> expression can be determined.
> >>> The GREATEST and LEAST functions can result in NULL only 
> if at run 
> >>> time all value expressions result in NULL.
> >> 
> >> The strict interpretation is mathematically cleaner, no doubt, but 
> >> offhand it seems less useful.
> >> 
> 
> > I know it, But when moustly PostgreSQL function is strict I 
> desided so 
> > greatest and least will be strict. There is two analogy:
> 
> > one, normal comparing which implicate strinct aggregate 
> function which 
> > ignore NULL.
> 
> > Tom I don't know, what is better. Maybe Oracle,
> 
> > because
> 
> > least(nullif(col2, +max), nullif(col2, +max)) isn't really 
> readable, 
> > but it's "precedens" for PostgreSQL. I selected more conservative 
> > solution, but my patches are only start points for 
> discussion (really) :).
> 
> > Please, if You think, so Oracle way is good, correct it.
> 
> I'm still favoring non-strict but it deserves more than two votes.
> Anybody else have an opinion?
> 
>   regards, tom lane
> 
> ---(end of 
> broadcast)---
> TIP 8: explain analyze is your friend
> 
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] contrib/rtree_gist into core system?

2005-06-26 Thread John Hansen
Tom Lane Wrote:

> ... but rtree has always 
> been marginal, and it's very hard to see where it can win over gist.

Simplicity!

Implementing rtree operators and support functions is FAR simpler than
implementing the GiST equivalents.

For example, suppose all you want to implement is the ~ operator for a
custom type, then technically all you need is 4 functions (well, 5
including the stub operators)

bool contains(type,type);
type intersect(type,type);
type union(type,type);
void size(type,*float);

And the 6 other operators simply defined as:
bool false(type) { return false; }

For GiST you still need 7 support functions + the operator function,
some of which aren't exactly simple to implement, the picksplit for
instance.

So I'd not recommend getting rid of rtree just yet. At least not until
someone has written an extensive howto on the subject of GiST
implementation.

... John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] contrib/rtree_gist into core system?

2005-06-27 Thread John Hansen
Tom Lane [mailto:[EMAIL PROTECTED] Wrote:

> There's no HOWTO for rtree either.  Again, my point is not 
> that one couldn't be written; it's that we would probably be 
> better off spending the effort on a HOWTO for gist.

No, but the _current_ implementation of the rtree operators are ver much
self explaining and need no howto.

Union(x,y) = x + y
Intersect(x,y) = the values that are present in both x and y, or
_overlapping_region_
Size(x) = the size of the area/length of the line, number of elements,
etc...

Now, how simple is that compared to gist?

I for one, is yet to produce a working example of something as simple as
indexing an array of 2 elements [x y] represented by a custom type as
'[x y]' in string format (returned by type_out) internally stored as a
char[2], so that I can fetch all rows where [x y] = ':y' (:y meaning 2nd
element in array, x: meaning first element in array.

I chose this as something simple to play with, having no practical
application for me, but to get an understanding of gist, For now,. I
have put it in the too hard basket.

I did however in about half a day implement rtree support for inet/cidr
(ipv4 only) as you might recall.

Kind Regards,

John


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] How to get the Oid of the Primary key relation ??

2004-05-27 Thread John Hansen
On Fri, 2004-05-28 at 06:07, Ramanujam H S Iyengar wrote:
> Can some one help me in the question !!!
> 
> Given the Oid of a relation .. how to get the Oid of the Primary key index 
> on that relation ?
> 
Attached, a couple of function I wrote that might do what you're after.
> 
> Thanks in adv,
> 
> -Ramu
> 
> _
> Post Classifieds on MSN classifieds. http://go.msnserver.com/IN/44045.asp
> Buy and Sell on MSN Classifieds.
> 
> 
> ---(end of broadcast)---
> TIP 4: Don't 'kill -9' the postmaster

Regards,

John
--
-- PostgreSQL database dump
--

SET SESSION AUTHORIZATION 'postgres';

--
-- select pk_column('tablename');
-- returns the name of the  primary key column
--
CREATE FUNCTION pk_column(text) RETURNS text
AS '
select attname::text
	from pg_class, pg_constraint, pg_attribute
	where pg_class.oid = conrelid
	and contype=''p'' 
	and attrelid = pg_class.oid 
	and attnum in (array_to_string(conkey,'',''))
	and relname=$1;
'
LANGUAGE sql STABLE;


--
-- select last_insert_id('tablename','columnname');
-- returns currval of the sequence assigned as default to columnname
--
CREATE FUNCTION last_insert_id(text, text) RETURNS bigint
AS '
select currval(
(select 
split_part(adsrc,,2) as seq
from pg_class
join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
join pg_attrdef 
on (pg_attrdef.adnum = pg_attribute.attnum 
and pg_attrdef.adrelid = pg_attribute.attrelid)
where pg_class.relname = $1
and pg_attribute.attname = $2)
);
'
LANGUAGE sql;


--
-- select last_insert_id('tablename');
-- returns currval of the sequence assigned as default to the primary key column
--
CREATE FUNCTION last_insert_id(text) RETURNS bigint
AS '
select last_insert_id($1,pk_column($1));
'
LANGUAGE sql;

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] tablespaces and DB administration

2004-05-31 Thread John Hansen
On Fri, 2004-05-28 at 08:15, [EMAIL PROTECTED] wrote:
> > [EMAIL PROTECTED] wrote:
> >
> >>You are absolutely wrong on all accounts here. A RAID5 system is slower
> >>than a single spindle as it is only as fast as the slowest disk in the
> >>stripe and the overhead of the RAID.
> >>
> > Huh, what kind of controller do you use... Sounds like some "value" IDE
> > one. I'd never suggest IDE raid5 for DBMS purposes anyway.
> 
> Actually, my RAID system, currently on my test system, is fully UWLVD SCSI
> with fast spindles.
> 
> Here is a logical factual question for you to answer: how can a set of
> disks, lets say 7, 6 data drives with one parity, deliver results faster
> than the slowest drive in the stripe?
> 
> If you say predictive and intelligent caching, yea, maybe, but *all* disks
> today have caching, but the initial request still has to wait for the
> longest seek time across all spindles and the slowest spindle position.
> I've been dealing with RAID systems for almost a decade now, and they are
> not a magic bullet.
> 
> RAID systems are always slower than their compnent disks. This is the
> drawback to using them and a fundimental limitation. A single disk will
> average 1/2 spindle seek, assuming its initial head placement is random,
> and average 1/2 spindle revolution to track, assuming no out of order
> sector access. A RAID system has to wait for the slowest disk, thus while
> a single disk can average 1/2 seek and rotation, two disks will not. So,
> your raid disk access will ALWAYS be slower or as slow as a single disk
> access not including the additional RAID processing.
> 

Some high end SCSI drives comes with an option for using an external
source for spindle syncronization. These drives will thus not have to
wait for rotation, as head positions are aligned.

> The advantage to a RAID is that a number of smaller disks can look like a
> big disk with some redundency. The advantage to a RAID controller is that
> the RAID processing and parity generation overhead is done on an external
> device. Using a RAID controller that presents a SCSI LUN is great because
> you don't need to trust third party drivers. All in all, RAID is a good
> idea, but it isn't faster.
> 
> As for IDE RAID, IDE RAID is an awesome idea. SCSI disks are just too
> expensive. Infortrend has a cool IDE to SCSI or Fibre RAID system that
> rocks.
> 
Addonics has these too, I've been using them with great results.
> 
> 
> 
> ---(end of broadcast)---
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] ACLs versus ALTER OWNER

2004-06-02 Thread John Hansen
On Wed, 2004-06-02 at 18:44, Christopher Kings-Lynne wrote:
> > REVOKE ALL ON TABLE foo FROM GRANTOR [USER] alice;
> > 
> > The super user must really be a *super* user.
> 
> I think we need a proper 'effective user' facility.
> 
> At the moment, there's breakage if a super user creates a language, then 
> drops their superuser privs, then the dump cannot be restored.
> 
> All other failure cases also exist.  eg if a gumby user creates a table 
> in a schema, then has his permission to create tables in that schema 
> revoked.  The dump will be broken.
> 
> The solution seems to me that we need to have an 'effective_user' SET 
> option so that the superuser doing the restore can still create tables 
> owned by the gumby, even though the gumby does not have privileges to do 
>   so.

If I remember correctly, we already have this option.
ALTER table OWNER to newowner;
Perhaps pg_dump should just include; ALTER relation OWNER to
originalowner; at the end of the dump, instead of connecting as the
owner to restore it.

> 
> Chris
> 
> 
> ---(end of broadcast)---
> TIP 3: if posting/reading through Usenet, please send an appropriate
>   subscribe-nomail command to [EMAIL PROTECTED] so that your
>   message can get through to the mailing list cleanly

Regards,

John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Cannot initdb in cvs tip

2004-06-19 Thread John Hansen
On Sun, 2004-06-20 at 08:04, Dave Page wrote:
>  
> > -Original Message-
> > From: Tom Lane [mailto:[EMAIL PROTECTED] 
> > Sent: 19 June 2004 00:22
> > To: Dave Page
> > Cc: PostgreSQL-development
> > Subject: Re: [HACKERS] Cannot initdb in cvs tip 
> > 
> > "Dave Page" <[EMAIL PROTECTED]> writes:
> > > I'm getting the following error when trying to initdb with CVS tip.
> > 
> > > creating template1 database in 
> > C:/msys/1.0/local/pgsql/data/base/1 ...
> > > ERROR:  could not open segment 1 of relation 1663/1/1255 
> > (target block
> > > 26189776): No such file or directory
> > 
> > The target block number is obviously broken :-(.  But maybe 
> > you have a build consistency problem --- did you try a make 
> > distclean and full rebuild?
> 
> OK, that cured that one - thanks.
> 
> > > although it says it's clearing the contents of the directory, in 
> > > actual fact it leaves the directory structure in place, thus a 
> > > subsequent initdb will not run without a manual clearup.
> > 
> > Hm.  The rmtree() function in initdb.c is responsible for 
> > this, and I see it has WIN32-specific behavior, which is 
> > evidently wrong.
> > Can you recommend a fix?
> 
> The current solution does an "rmdir /q /s $PGDATA" if the datadir was
> created, and "del /q /s $PGDATA" if the directory already existed. The
> second case  will not work, as del will not remove directories. AFAICS,
> there is no easy way to do this using system() as rmdir won't accept
> wildcards, so we can't do "del $PGDATA/* && rmdir $PGDATA/*".
> 
> It seems to me that the simple answer is to put Andrew's recursive
> unlink code back in (as he suggested), which Bruce removed as rm etc.
> were being used in commands/dbcommands.c (which should work fine under
> Windows). Patch below
> 

you could of course rmdir /s /q $PGDATA && mkdir $PGDATA if the purpose
is to leave the directory intact if it already existed prior to install.

Regards,

John
> Regards, Dave
> 
> *** initdb.c.orig Sat Jun 19 22:15:28 2004
> --- initdb.c  Sat Jun 19 23:02:10 2004
> ***
> *** 132,137 
> --- 132,144 
>   static void *xmalloc(size_t size);
>   static char *xstrdup(const char *s);
>   static bool rmtree(char *path, bool rmtopdir);
> + 
> + #ifdef WIN32
> + static int  init_unlink(const char *);
> + #else
> + #define init_unlink(x) unlink( (x) )
> + #endif   /* WIN32 */
> + 
>   static char **replace_token(char **lines, char *token, char
> *replacement);
>   static char **readfile(char *path);
>   static void writefile(char *path, char **lines);
> ***
> *** 245,264 
>   static bool
>   rmtree(char *path, bool rmtopdir)
>   {
> ! charbuf[MAXPGPATH + 64];
>   
> ! #ifndef WIN32
> ! /* doesn't handle .* files, but we don't make any... */
> ! snprintf(buf, sizeof(buf), "rm -rf \"%s\"%s", path,
> !  rmtopdir ? "" : "/*");
> ! #else
> ! snprintf(buf, sizeof(buf), "%s /s /q \"%s\"",
> !  rmtopdir ? "rmdir" : "del", path);
> ! #endif
>   
> ! return !system(buf);
>   }
>   
>   
>   /*
>* make a copy of the array of lines, with token replaced by
> replacement
> --- 252,349 
>   static bool
>   rmtree(char *path, bool rmtopdir)
>   {
> ! charfilepath[MAXPGPATH];
> ! DIR*dir;
> ! struct dirent *file;
> ! char  **filenames;
> ! char  **filename;
> ! int numnames = 0;
> ! struct stat statbuf;
>   
> ! /*
> !  * we copy all the names out of the directory before we start
> modifying
> !  * it.
> !  *
> !  */
> ! 
> ! dir = opendir(path);
> ! if (dir == NULL)
> ! return false;
>   
> ! while ((file = readdir(dir)) != NULL)
> ! {
> ! if (strcmp(file->d_name, ".") != 0 &&
> strcmp(file->d_name, "..") != 0)
> ! numnames++;
> ! }
> ! 
> ! rewinddir(dir);
> ! 
> ! filenames = xmalloc((numnames + 2) * sizeof(char *));
> ! numnames = 0;
> ! 
> ! while ((file = readdir(dir)) != NULL)
> ! {
> ! if (strcmp(file->d_name, ".") != 0 &&
> strcmp(file->d_name, "..") != 0)
> ! filenames[numnames++] = xstrdup(file->d_name);
> ! }
> ! 
> ! filenames[numnames] = NULL;
> ! 
> ! closedir(dir);
> ! 
> ! /* now we have the names we can start removing things */
> ! 
> ! for (filename = filenames; *filename; filename++)
> ! {
> ! snprintf(filepath, MAXPGPATH, "%s/%s", path, *filename);
> ! 
> ! if (stat(filepath, &statbuf) != 0)
> ! return false;
> ! 
> ! if (S_ISDIR(statbuf.st_mode))
> ! {
> ! /* call ourselves recursively for a directory */
> ! if (!rmtree(filepath, true))
> ! return false;
> ! }
> ! else
> ! {
> ! if (init_unlink(filepath) !=

Re: [HACKERS] email browser?

2004-06-20 Thread John Hansen
On Fri, 2004-06-18 at 22:47, Santo Quartarone wrote:
> What's the safest email browser?
> 
> 

telnet mailserver.domain.tld 110

> 
> 
> ---(end of broadcast)---
> TIP 3: if posting/reading through Usenet, please send an appropriate
>   subscribe-nomail command to [EMAIL PROTECTED] so that your
>   message can get through to the mailing list cleanly

... John


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] Relay Access Denied

2004-07-05 Thread John Hansen
Seems mail.postgresql.org refuses mail for [EMAIL PROTECTED]
at present?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Relay Access Denied

2004-07-05 Thread John Hansen
Disregard,

Was sent to postgresql.com

...
 

-Original Message-
From: John Hansen 
Sent: Monday, July 05, 2004 11:21 PM
To: [EMAIL PROTECTED]
Subject: Relay Access Denied

Seems mail.postgresql.org refuses mail for [EMAIL PROTECTED]
at present?

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[HACKERS] UNICODE characters above 0x10000

2004-08-05 Thread John Hansen
I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.


Regards,

John


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Regards,

John

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
Sent: Friday, August 06, 2004 2:20 PM
To: 'Hackers'
Subject: [HACKERS] UNICODE characters above 0x1

I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.


Regards,

John


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

Regards,

John Hansen

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 4:37 AM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

"John Hansen" <[EMAIL PROTECTED]> writes:
> Attached, as promised, small patch removing the limitation, adding 
> correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string.  Also, doesn't pg_mblen already know the
length rules for UTF8?  Why are you duplicating that knowledge?

regards, tom lane




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
Possibly, since I got it wrong once more
About to give up, but attached, Updated patch.


Regards,

John Hansen

-Original Message-
From: Oliver Elphick [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 3:56 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1

On Sat, 2004-08-07 at 06:06, Tom Lane wrote:
> Now it's entirely possible that the underlying support is a few bricks

> shy of a load --- for instance I see that pg_utf_mblen thinks there 
> are no UTF8 codes longer than 3 bytes whereas your code goes to 4.  
> I'm not an expert on this stuff, so I don't know what the UTF8 spec 
> actually says.  But I do think you are fixing the code at the wrong
level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode.  How many
of our supported platforms don't have these?  If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

-- 
Oliver Elphick  [EMAIL PROTECTED]
Isle of Wight  http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
 
 "Be still before the LORD and wait patiently for him;
  do not fret when men succeed in their ways, when they
  carry out their wicked schemes." 
Psalms 37:7 





wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
Ahh, but that's not the case. You cannot just delete the check, since
not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
appear in a byte sequence for instance.
UTF8 is more that two bytes btw, up to 6 bytes are used to represent an
UTF8 character.
The 5 and 6 byte characters are currently not in use tho.

I didn't actually notice the difference in UTF8 width between my
original patch and my last, so attached, updated patch.

Regards,

John Hansen

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 3:07 PM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

"John Hansen" <[EMAIL PROTECTED]> writes:
> My apologies for not reading the code properly.

> Attached patch using pg_utf_mblen() instead of an indexed table.
> It now also do bounds checks.

I think you missed my point.  If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable).  What's really at stake here is whether anything else
breaks if we do that.  What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.

regards, tom lane




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
This should do it.

Regards,

John Hansen 

-Original Message-
From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 5:02 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

On Sat, 7 Aug 2004, Tom Lane wrote:

> question at hand is whether we can support 32-bit characters or not 
> --- and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple testing, 
make sure regexps work and then treat anything else that might not work, as bugs to be 
fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun at all :-)

My previous mail talked about utf-8 translation. Not all characters possible to form 
using utf-8 are assigned by the unicode org. However, the part that interprets the 
unicode strings are in the os so different os'es can give different results. So I 
think pg should just accept even 6 byte utf-8 sequences even if some characters are 
not currently assigned.

--
/Dennis Björklund





wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Yes, but the specification allows for 6byte sequences, or 32bit
characters.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

Regards,

John Hansen

-Original Message-
From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:09 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; John Hansen; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 

> Dennis Bjorklund <[EMAIL PROTECTED]> writes:
> > ... This also means that the start byte can never start with 7 or 8 
> > ones, that is illegal and should be tested for and rejected. So the 
> > longest utf-8 sequence is 6 bytes (and the longest character needs 4

> > bytes (or 31 bits)).
> 
> Tatsuo would know more about this than me, but it looks from here like

> our coding was originally designed to support only 16-bit-wide 
> internal characters (ie, 16-bit pg_wchar datatype width).  I believe 
> that the regex library limitation here is gone, and that as far as 
> that library is concerned we could assume a 32-bit internal character 
> width.  The question at hand is whether we can support 32-bit 
> characters or not --- and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype.  However I doubt there's
actually a need for 32-but width character sets. Even Unicode only uese
up 0x0010, so 24-bit should be enough...
--
Tatsuo Ishii



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
4 actually,
10 needs four bytes:

0xxx 10xx 10xx 10xx
10 = 1010  

Fill in the blanks, starting from the bottom, you get:
 1010 1011 1011

Regards,

John Hansen 

-Original Message-
From: Christopher Kings-Lynne [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:47 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1

> Now it's entirely possible that the underlying support is a few bricks

> shy of a load --- for instance I see that pg_utf_mblen thinks there 
> are no UTF8 codes longer than 3 bytes whereas your code goes to 4.  
> I'm not an expert on this stuff, so I don't know what the UTF8 spec 
> actually says.  But I do think you are fixing the code at the wrong
level.

Surely there are UTF-8 codes that are at least 3 bytes.  I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris




---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Well, maybe we'd be better off, compiling a list of (in?)valid ranges
from the full unicode database 
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Unihan.txt)
and with every release of pg, update the detection logic so only valid
characters are allowed?

Regards,

John Hansen

-Original Message-
From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:46 PM
To: John Hansen
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 

> Yes, but the specification allows for 6byte sequences, or 32bit 
> characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its specification.

> As dennis pointed out, just because they're not used, doesn't mean we 
> should not allow them to be stored, since there might me someone using

> the high ranges for a private character set, which could very well be 
> included in the specification some day.

We should expand it to 64-bit since some day the specification might be
changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Yea,. I know

10 - 10 : 2 separate planes iirc

... John 

-Original Message-
From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 9:06 PM
To: John Hansen
Cc: Tatsuo Ishii; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1 

On Sat, 7 Aug 2004, John Hansen wrote:

> should not allow them to be stored, since there might me someone using 
> the high ranges for a private character set, which could very well be 
> included in the specification some day.

There are areas reserved for private character sets.

--
/Dennis Björklund




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Dennis Bjorklund
> Sent: Saturday, August 07, 2004 10:48 PM
> To: Takehiko Abe
> Cc: [EMAIL PROTECTED]
> Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1
> 
> On Sat, 7 Aug 2004, Takehiko Abe wrote:
> 
> It looked like you sent the last mail only to me and not the 
> list. I assume it was a misstake and I send the reply to both.
> 
> > > Is there a specific reason you want to restrict it to 24 bits?
> > 
> > ISO 10646 is said to have removed its private use codepoints outside 
> > of the Unicode 0 - 10 range to ensure the compatibility with Unicode.
> > 
> > see Section C.2 and C.3 of Unicode 4.0 Appendix C 
> "Relationship to ISO
> > 10646": <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>.
> 
> The one and only reason for allowing 31 bit is that it's 
> defined by iso 10646. In practice there is probably no one 
> that uses the upper part of
> 10646 so not supporting it will most likely not hurt anyone.
>   
>   
> I'm happy either way so I will put my voice on letting PG use 
> unicode (not ISO 10646) and restrict it to 24 bits. By the 
> time someone wants (if ever) iso 10646 we probably have 
> support for different charsets and can easily handle both at 
> the same time.
> 

Point taken. 
Since we're supporting UTF8, and not ISO 10646.

Now, is it really 24 bits tho? 
Afaict, it's really 21 (0 - 10 or 0 - xxx1  )

This would require that we suport 4 byte sequences
(0100 1000 1011 1011 = 10)

> --
> /Dennis Björklund
> 
> 
> ---(end of 
> broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 
> 


Regards,

John Hansen

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message-
> From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, August 07, 2004 11:23 PM
> To: John Hansen
> Cc: Takehiko Abe; [EMAIL PROTECTED]
> Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1
> 
> On Sat, 7 Aug 2004, John Hansen wrote:
> 
> > Now, is it really 24 bits tho? 
> > Afaict, it's really 21 (0 - 10 or 0 - xxx1  
> )
> 
> Yes, up to 0x10 should be enough.
> 
> The 24 is not really important, this is all about what utf-8 
> strings to accept as input. The strings are stored as utf-8 
> strings and when processed inside pg it uses wchar_t that is 
> 32 bit (on some systems at least). By restricting the utf-8 
> input to unicode we can in the future store each character as 
> 3 bytes if we want.

Which brings us back to something like the attached...

> 
> --
> /Dennis Björklund
> 
> 
> 

Regards,

John Hansen


wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message-
> From: Tom Lane [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, August 08, 2004 2:43 AM
> To: Dennis Bjorklund
> Cc: Tatsuo Ishii; John Hansen; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]
> Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 
> 
> Dennis Bjorklund <[EMAIL PROTECTED]> writes:
> > On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
> >> Anyway my point is if current specification of Unicode only allows 
> >> 24-bit range, why we need to allow usage against the specification?
> 
> > Is there a specific reason you want to restrict it to 24 bits?
> 
> I see several places that have to allocate space on the basis 
> of the maximum encoded character length possible in the 
> current encoding (look for uses of 
> pg_database_encoding_max_length).  Probably the only one 
> that's really significant for performance is text_substr(), 
> but that's enough to be an argument against setting maxmblen 
> higher than we have to.
> 
> It looks to me like supporting 4-byte UTF-8 characters would 
> be enough to handle the existing range of Unicode codepoints, 
> and that is probably as much as we want to do.
> 
> If I understood what I was reading, this would take several things:
> * Remove the "special UTF-8 check" in pg_verifymbstr;

I strongly disagree, this would mean one could store any sequence of
characters in the db, as long as the bytes are above 0x80. This would
not be valid utf8, and leave the data in an inconsistent state.
Setting the client encoding to unicode, implies that this is what we're
going to feed the database, and should guarantee, that what comes out of
a select is valid utf8. We can make sure of that, by doing the check
before it's inserted.

> * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte
case;

pg_utf_mblen should handle any case according to the specification.
Currently, it will return 3, even for 4,5, and 6 byte sequences. Those
places where pg_utf_mblen is called, we should check to make sure, that
the length is between 1 and 4 inclusive, and that the sequence is valid.
This is what I made the patch for.

> * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

That I have no problem with.

> Are there any other places that would have to change?  Would 
> this break anything?  The testing aspect is what's bothering 
> me at the moment.
> 
>   regards, tom lane
> 
> 

Just my $0.02 worth,

Kind Regards,

John Hansen

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message-
> From: Oliver Elphick [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, August 08, 2004 7:43 AM
> To: Tom Lane
> Cc: John Hansen; Hackers; Patches
> Subject: Re: [HACKERS] UNICODE characters above 0x1
> 
> On Sat, 2004-08-07 at 07:10, Tom Lane wrote:
> > Oliver Elphick <[EMAIL PROTECTED]> writes:
> > > glibc provides various routines (mb...) for handling Unicode.  How

> > > many of our supported platforms don't have these?
> > 
> > Every one that doesn't use glibc.  Don't bother proposing a
glibc-only 
> > solution (and that's from someone who works for a glibc-only
company; 
> > you don't even want to think about the push-back you'll get from
other 
> > quarters).
> 
> No. that's not what I was proposing.  My suggestion was to 
> use these routines if they are sufficiently widely 
> implemented, and our own routines where standard ones are not 
> available.
> 
> The man page for mblen says
> "CONFORMING TO
>ISO/ANSI C, UNIX98"
> 
> Is glibc really the only C library to conform?
> 
> If using the mb... routines isn't feasible, IBM's ICU library
> (http://oss.software.ibm.com/icu/) is available under the X 
> licence, which is compatible with BSD as far as I can see.  
> Besides character conversion, ICU can also do collation in 
> various locales and encodings. 
> My point is, we shouldn't be writing a new set of routines to 
> do half a job if there are already libraries available to do 
> all of it.
> 

This sounds like a brilliant move, if anything.

> -- 
> Oliver Elphick  
> [EMAIL PROTECTED]
> Isle of Wight  
> http://www.lfix.co.uk/oliver
> GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F 
> A543 10EA
>  
>      "Be still before the LORD and wait patiently for him;
>   do not fret when men succeed in their ways, when they
>   carry out their wicked schemes." 
> Psalms 37:7 
> 
> 
> 

Kind Regards,

John Hansen


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> Well, this is still working at the wrong level.  The code 
> that's in pg_verifymbstr is mainly intended to enforce the 
> *system wide* assumption that multibyte characters must have 
> the high bit set in every byte.  (We do not support encodings 
> without this property in the backend, because it breaks code 
> that looks for ASCII characters ... such as the main 
> parser/lexer ...)  It's not really intended to check that the 
> multibyte character is actually legal in its encoding.
> 

Ok, point taken.

> The "special UTF-8 check" was never more than a very 
> quick-n-dirty hack that was in the wrong place to start with. 
>  We ought to be getting rid of it not institutionalizing it.  
> If you want an exact encoding-specific check on the 
> legitimacy of a multibyte sequence, I think the right way to 
> do it is to add another function pointer to pg_wchar_table 
> entries to let each encoding have its own check routine.  
> Perhaps this could be defined so as to avoid a separate call 
> to pg_mblen inside the loop, and thereby not add any new 
> overhead.  I'm thinking about an API something like
> 
>   int validate_mbchar(const unsigned char *str, int len)
> 
> with result +N if a valid character N bytes long is present 
> at *str, and -N if an invalid character is present at *str 
> and it would be appropriate to display N bytes in the complaint.
> (N must be <= len in either case.)  This would reduce the 
> main loop of pg_verifymbstr to a call of this function and an 
> error-case-handling block.
> 

Sounds like a plan...

>   regards, tom lane
> 
> 

Regards,

John Hansen

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


  1   2   >