Re: [GENERAL] is there a way to deliver an array over column from a query window?

2013-04-26 Thread Rafał Pietrak

W dniu 04/26/2013 12:25 AM, Merlin Moncure pisze:

On Thu, Apr 25, 2013 at 1:30 PM, Rafał Pietrak  wrote:

[--]


No, I don't (manual:
http://www.postgresql.org/docs/9.1/static/tutorial-window.html, have just
one word "distinct" on that page, and it's not in the above context). And I
cannot duplicate the above:

# select array_agg(distinct v order by v desc) from (select
generate_series(1,3) v union all select generate_series(1,3)) q;
ERROR:  syntax error at or near "order"
LINE 1: select array_agg(distinct v order by v desc) from (select ge...


Did I miss something??

This feature was added w/9.0.   This means you are on 8.4.  Time to upgrade...


Yes, true. I kind of postpone that until debian-wheeze gets released as 
stable.


But what about the postgres documentation. I haven't found that feature 
in postgres-v9.1 docs (link above). Where can I read about it?


-R



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Checking for changes in other tables

2013-04-26 Thread CR Lender
I have two tables with countries and persons living in those countries:

create table countries (
codechar(2) not null primary key,
eu  boolean not null
);

insert into countries values
('AR', false),
('BE', true),
('CH', false),
('DE', true);

create table persons (
nametextnot null primary key,
country char(2) not null references countries(code)
);

insert into persons (name, country) values
('Arthur',  'AR'),
('Betty',   'BE'),
('Charlie', 'CH'),
('Diane',   'DE');

Enter a third table for loans that can only be made between persons
living in EU countries:

create table eu_loans (
donor   textnot null references persons(name),
recipient   textnot null references persons(name),
primary key (donor, recipient)
);

insert into eu_loans (donor, recipient) values
('Diane', 'Betty');

I can add a trigger on eu_loans to check if Diane and Betty both live in
the EU. The problem is how to prevent one of them from moving to a
non-EU country (if they do, the loan has to be cancelled first). They
are however allowed to move to other EU countries.

At the moment, this is checked by the application, but not enforced by
the database. I could add more triggers to the persons table (and
another one on countries), but that doesn't "feel" right... countries
and persons are base data and shouldn't need to "know" about other
tables using their records. Ideally, eu_loans would have a check
constraint to verify that its contents remain valid.

Is there any way to ensure that all donors and recipients in eu_loans
are in the EU, without altering the countries and persons tables?

Thanks for any suggestions.

crl


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Checking for changes in other tables

2013-04-26 Thread Misa Simic
two triggers?.

one on eu_loans... and one on persons (if valid eu_loan - cant move...)


2013/4/26 CR Lender 

> I have two tables with countries and persons living in those countries:
>
> create table countries (
> codechar(2) not null primary key,
> eu  boolean not null
> );
>
> insert into countries values
> ('AR', false),
> ('BE', true),
> ('CH', false),
> ('DE', true);
>
> create table persons (
> nametextnot null primary key,
> country char(2) not null references countries(code)
> );
>
> insert into persons (name, country) values
> ('Arthur',  'AR'),
> ('Betty',   'BE'),
> ('Charlie', 'CH'),
> ('Diane',   'DE');
>
> Enter a third table for loans that can only be made between persons
> living in EU countries:
>
> create table eu_loans (
> donor   textnot null references persons(name),
> recipient   textnot null references persons(name),
> primary key (donor, recipient)
> );
>
> insert into eu_loans (donor, recipient) values
> ('Diane', 'Betty');
>
> I can add a trigger on eu_loans to check if Diane and Betty both live in
> the EU. The problem is how to prevent one of them from moving to a
> non-EU country (if they do, the loan has to be cancelled first). They
> are however allowed to move to other EU countries.
>
> At the moment, this is checked by the application, but not enforced by
> the database. I could add more triggers to the persons table (and
> another one on countries), but that doesn't "feel" right... countries
> and persons are base data and shouldn't need to "know" about other
> tables using their records. Ideally, eu_loans would have a check
> constraint to verify that its contents remain valid.
>
> Is there any way to ensure that all donors and recipients in eu_loans
> are in the EU, without altering the countries and persons tables?
>
> Thanks for any suggestions.
>
> crl
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Checking for changes in other tables

2013-04-26 Thread D'Arcy J.M. Cain
On Fri, 26 Apr 2013 11:01:28 +0200
CR Lender  wrote:
> I have two tables with countries and persons living in those
> countries:
> 
> create table countries (
> codechar(2) not null primary key,

Isn't this redundant?  Primary keys are always NOT NULL.

> Enter a third table for loans that can only be made between persons
> living in EU countries:
> 
> create table eu_loans (
> donor   textnot null references persons(name),
> recipient   textnot null references persons(name),
> primary key (donor, recipient)
> );

Side question - are you really limiting them to one loan each?  Can't a
donor have two active loans with the same recipient? 

> I can add a trigger on eu_loans to check if Diane and Betty both live
> in the EU. The problem is how to prevent one of them from moving to a
> non-EU country (if they do, the loan has to be cancelled first). They
> are however allowed to move to other EU countries.

Wouldn't two constraints, one for each of donor and recipient, do the
job?  Moving a person out of the EU would have the same effect as
deleting them.  The constraint would prevent it.

-- 
D'Arcy J.M. Cain  |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 788 2246 (DoD#0082)(eNTP)   |  what's for dinner.
IM: da...@vex.net, VOIP: sip:da...@vex.net


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Checking for changes in other tables

2013-04-26 Thread CR Lender
On 2013-04-26 12:17, D'Arcy J.M. Cain wrote:
> On Fri, 26 Apr 2013 11:01:28 +0200
> CR Lender  wrote:

>> create table countries (
>> codechar(2) not null primary key,
> 
> Isn't this redundant?  Primary keys are always NOT NULL.

Yes, I forgot to remove the NOT NULL when I adjusted the example.

> Side question - are you really limiting them to one loan each?  Can't a
> donor have two active loans with the same recipient? 

This is just a very reduced example structure (I wouldn't make a
person's first name the primary key, either :-). The actual case doesn't
even involve persons or loans, but it's far too complex to be used as an
example. It took weeks for me to understand how everything in that
database was (supposed to be) connected.

>> I can add a trigger on eu_loans to check if Diane and Betty both live
>> in the EU. The problem is how to prevent one of them from moving to a
>> non-EU country (if they do, the loan has to be cancelled first). They
>> are however allowed to move to other EU countries.
> 
> Wouldn't two constraints, one for each of donor and recipient, do the
> job?  Moving a person out of the EU would have the same effect as
> deleting them.  The constraint would prevent it.

I'm not sure I'm following... how would such a constraint look like?

Thanks,
crl


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is there a way to deliver an array over column from a query window?

2013-04-26 Thread Merlin Moncure
On Fri, Apr 26, 2013 at 3:15 AM, Rafał Pietrak  wrote:
> W dniu 04/26/2013 12:25 AM, Merlin Moncure pisze:
>
>> On Thu, Apr 25, 2013 at 1:30 PM, Rafał Pietrak 
>> wrote:
>
> [--]
>
>>>
>>> No, I don't (manual:
>>> http://www.postgresql.org/docs/9.1/static/tutorial-window.html, have just
>>> one word "distinct" on that page, and it's not in the above context). And
>>> I
>>> cannot duplicate the above:
>>> 
>>> # select array_agg(distinct v order by v desc) from (select
>>> generate_series(1,3) v union all select generate_series(1,3)) q;
>>> ERROR:  syntax error at or near "order"
>>> LINE 1: select array_agg(distinct v order by v desc) from (select ge...
>>> 
>>>
>>> Did I miss something??
>>
>> This feature was added w/9.0.   This means you are on 8.4.  Time to
>> upgrade...
>
>
> Yes, true. I kind of postpone that until debian-wheeze gets released as
> stable.
>
> But what about the postgres documentation. I haven't found that feature in
> postgres-v9.1 docs (link above). Where can I read about it?

see here:
http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-AGGREGATES

(you were looking at the window function specific docs, not the
general aggregate documentation).

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Checking for changes in other tables

2013-04-26 Thread Richard Poole
On Fri, Apr 26, 2013 at 11:01:28AM +0200, CR Lender wrote:

> Is there any way to ensure that all donors and recipients in eu_loans
> are in the EU, without altering the countries and persons tables?

One way to do this would be to add countries to the eu_loans table so
it looks like this:

create table eu_loans (
donor text not null,
donor_country char(2) not null,
recipient text not null,
recipient_country char(2) not null,
primary key(donor, recipient),
foreign key (donor, donor_country) references persons (name, country) 
on update cascade,
foreign key (recipient, recipient_country) references persons (name, 
country) on update cascade
);

then create an sql function to tell you whether a country is in the eu:

create function country_in_eu (char(2)) returns bool as $$
select count(*) > 0 from countries where code = $1 and eu = 't'
$$ language 'sql';

and add two constraints to eu_loans:

alter table eu_loans add constraint donor_in_eu 
check(country_in_eu(donor_country));
alter table eu_loans add constraint recipient_in_eu 
check(country_in_eu(recipient_country));

This will give an error if someone moves outside the EU (but not if a
country leaves the EU).

It may or may not seem elegant depending on your thinking but it does
have the effect you're looking for. Of course you could set things up
so that you could do an insert to eu_loans specifying just the donor
and recipient names and the system would populate the country fields
for you by looking up in persons, throwing an error if appropriate.

Richard


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] UPDATE using 3 medium-sized tables causes runaway hash table and fills disk

2013-04-26 Thread Rowan Collins

Hi All,

I've come upon some very strange behaviour with an UPDATE query which 
causes Postgres to consume all the disk space on the server for no 
apparent reason.


Basically, I'm trying to run an UPDATE involving three medium-sized 
tables (~60k rows each), which generates a query plan with three Hash 
Joins. But when I try to execute it, Postgres appears to go into some 
kind of loop, gradually filling up the disk partition. After a long wait 
it responds with "ERROR: could not write to hash-join temporary file: No 
space left on device; SQL state: 53100"; the partition in question has 
9.5GB free at the beginning of the query - that's a lot of hash file!


If I run "ANALYZE temp_fares_mappings;" - the table which is being 
Updated, and is the outermost in the query plan - the problem goes away 
*even though the Query Plan hasn't changed*.


I'm not entirely sure how to simplify the query and still reproduce the 
issue, but I've produced an anonymized data sample and SQL query at 
http://rwec.co.uk/x/break_postgres.zip In this case, "Analyze" *does* 
change the query plan (as well as fixing the issue), but I can 
consistently reproduce the disk-filling behaviour using this sample on 
Postgres 9.0 and 9.2.


Note that it doesn't appear to matter if one or all of the tables are 
permanent, as I have been using various combinations for debugging, and 
always get the same behaviour. Trying to write a corresponding SELECT 
statement doesn't seem to generate the same issue, at least with the 
queries I've tried.


- The plan for the real query is here: http://explain.depesz.com/s/WTP
- Sample data and SQL query: http://rwec.co.uk/x/break_postgres.zip
- Query plan for sample data, without running Analyze, is here: 
http://explain.depesz.com/s/qsH
- Postgres version strings: "PostgreSQL 9.0.5 on x86_64-pc-linux-gnu, 
compiled by GCC x86_64-pc-linux-gnu-gcc (Gentoo 4.5.2 p1.1, pie-0.4.5) 
4.5.2, 64-bit" and "PostgreSQL 9.2.1 on x86_64-pc-linux-gnu, compiled by 
x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 
64-bit"


Does anyone have any idea what's going on here, and whether it is in 
fact a bug? It certainly feels like one...

Thanks for your time.
--
Rowan Collins
[IMSoP]


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is there a way to deliver an array over column from a query window?

2013-04-26 Thread Rafał Pietrak

W dniu 04/26/2013 12:25 AM, Merlin Moncure pisze:
[--]


select array_agg(v order by v desc) from generate_series(1,3) v;

also, 'distinct'
select array_agg(distinct v order by v desc) from (select
generate_series(1,3) v union all select generate_series(1,3)) q;


[]

This feature was added w/9.0.   This means you are on 8.4.  Time to upgrade...


I tried it, and it looks, like I can have just one column "processed" by 
the contruct "distinct ... order..." written under array_agg(). So it 
apears, that its "sole purpose" is to yield an array with distinct 
values, should there be duplicates in the source (table or query).


Having distinct values is fine, but I actually need to have the 
resulting array, contain values in sort-order, which is defined by 
values in another column. like:


array_agg(distinct v order by v) -- works in postgres, but actually I need:
array_agg(distinct v order by v,x) -- which doesn't. (ERROR: 
expressions must appear in argument list), but that suggestion is 
not possible here, e.g.:
array_agg(distinct v,x order by v,x) -- is actually a two argument 
array_agg() function call.


Elaborating on the second example above: what I need is: "feed the 
array_agg() only with distinct (by sorting for uniqness) v-values, but 
feed them sorted according to x-values". This is what I need.


I'd apreciate other ideas (my yesterday plan to avoid array untill late 
in the query failed). Any sugestions welcome. (Or may be I'm just plain 
wrong about the above "distinct" construct - then will apreciate a 
pointer to some documentation).


thnx,

-R


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] UPDATE using 3 medium-sized tables causes runaway hash table and fills disk

2013-04-26 Thread Rowan Collins

Hi All,

I've come upon some very strange behaviour with an UPDATE query which 
causes Postgres to consume all the disk space on the server for no 
apparent reason.


Basically, I'm trying to run an UPDATE involving three medium-sized 
tables (~60k rows each), which generates a query plan with three Hash 
Joins. But when I try to execute it, Postgres appears to go into some 
kind of loop, gradually filling up the disk partition. After a long wait 
it responds with "ERROR: could not write to hash-join temporary file: No 
space left on device; SQL state: 53100"; the partition in question has 
9.5GB free at the beginning of the query - that's a lot of hash file!


If I run "ANALYZE temp_fares_mappings;" - the table which is being 
Updated, and is the outermost in the query plan - the problem goes away 
*even though the Query Plan hasn't changed*.


I'm not entirely sure how to simplify the query and still reproduce the 
issue, but I've produced an anonymized data sample and SQL query at 
http://rwec.co.uk/x/break_postgres.zip In this case, "Analyze" *does* 
change the query plan (as well as fixing the issue), but I can 
consistently reproduce the disk-filling behaviour using this sample on 
Postgres 9.0 and 9.2.


Note that it doesn't appear to matter if one or all of the tables are 
permanent, as I have been using various combinations for debugging, and 
always get the same behaviour. Trying to write a corresponding SELECT 
statement doesn't seem to generate the same issue, at least with the 
queries I've tried.


- The plan for the real query is here: http://explain.depesz.com/s/WTP
- Sample data and SQL query: http://rwec.co.uk/x/break_postgres.zip
- Query plan for sample data, without running Analyze, is here: 
http://explain.depesz.com/s/qsH
- Postgres version strings: "PostgreSQL 9.0.5 on x86_64-pc-linux-gnu, 
compiled by GCC x86_64-pc-linux-gnu-gcc (Gentoo 4.5.2 p1.1, pie-0.4.5) 
4.5.2, 64-bit" and "PostgreSQL 9.2.1 on x86_64-pc-linux-gnu, compiled by 
x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 
64-bit"


Does anyone have any idea what's going on here, and whether it is in 
fact a bug? It certainly feels like one...

Thanks for your time.
--



Rowan Collins




Senior Developer 
| CWT Digital



		t. 0845 456 0070
		w. href="http://www.cwtdigital.com/?utm_source=E-mail%2BSignature&utm_medium=E-mail&utm_campaign=E-mail%2BSignature";>cwtdigital.com






--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Table containing only valid table names

2013-04-26 Thread Michael Graham

Hi all,

I'm trying to create a table that contains only valid table names.  I'm 
currently using reglass for the field type which works fine for when you 
are adding to the table (i.e. you can't add invalid tablesnames to the 
table).  But it is still possible to leave the table in an invalid state 
after doing a drop table.


So for example:

=> CREATE TABLE table_list (tablename regclass);
CREATE TABLE
=> INSERT INTO table_list VALUES ('foo');
ERROR:  relation "foo" does not exist
LINE 1: INSERT INTO table_list VALUES ('foo');
   ^
=> CREATE TABLE foo (a int);
CREATE TABLE
=> INSERT INTO table_list VALUES ('foo');
INSERT 0 1
=> DROP TABLE foo;
DROP TABLE
=> SELECT * FROM table_list;
 tablename
---
 122860
(1 row)

Does any one have any ideas that could be used to stop this from 
happening?  I'm not really in the position to have different users for 
the modification of the table_list and the drops so I don't think I can 
use different roles.


I'm pretty sure I can't do what I need as postgres doesn't support 
triggers on DDL but maybe I'm wrong.


Cheers,
--
Michael Graham



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Checking for changes in other tables

2013-04-26 Thread Richard Huxton

On 26/04/13 10:01, CR Lender wrote:

I can add a trigger on eu_loans to check if Diane and Betty both live in
the EU. The problem is how to prevent one of them from moving to a
non-EU country (if they do, the loan has to be cancelled first). They
are however allowed to move to other EU countries.

At the moment, this is checked by the application, but not enforced by
the database. I could add more triggers to the persons table (and
another one on countries), but that doesn't "feel" right... countries
and persons are base data and shouldn't need to "know" about other
tables using their records.


I think this is more a problem of terminology rather than your current 
triggers. Triggers aren't really "part" of a table, but they are 
observing it, so it's a sensible place to list them when viewing a 
table-definition in psql. There's no reason the trigger function is even 
in the same schema as the targetted table.


How would it feel if the syntax was more like the following?

CREATE TRIGGER ... OBSERVING UPDATES ON persons ...

or even

PUBLISH UPDATE,INSERT,DELETE ON persons AS person_changes;
SUBSCRIBE TO person_changes CALLING PROCEDURE ...;

A different "feel", but no difference in behaviour.

--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table containing only valid table names

2013-04-26 Thread Richard Huxton

On 26/04/13 16:09, Michael Graham wrote:

I'm pretty sure I can't do what I need as postgres doesn't support
triggers on DDL but maybe I'm wrong.


If you're still in development and not live, it'll be worth checking out 9.3

http://www.postgresql.org/docs/devel/static/event-triggers.html


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] apt.postgresql.org broken dependency?

2013-04-26 Thread Richard Huxton

On 25/04/13 18:01, Martín Marqués wrote:

Just tried upgrading and added the apt-postgresql.org repo to my
Debian server (on testing now) and I got some backages like barman
retained because some dependencies couldn't be satisfied.

Los siguientes paquetes tienen dependencias incumplidas:
  barman : Depende: python (<  2.7) pero 2.7.3-4 va a ser instalado
   Depende: python-argcomplete pero no va a instalarse



Since when 2.7.3 isn't larger then 2.7.


Is that not complaining that it *wants* a version of python < 2.7 and 
you have larger?


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is there a way to deliver an array over column from a query window?

2013-04-26 Thread Tom Lane
=?ISO-8859-2?Q?Rafa=B3_Pietrak?=  writes:
> array_agg(distinct v order by v) -- works in postgres, but actually I need:
> array_agg(distinct v order by v,x) -- which doesn't. (ERROR: 
> expressions must appear in argument list),

Why do you think you need that?  AFAICS, the extra order-by column could
not in any way affect the result of the operation.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table containing only valid table names

2013-04-26 Thread salah jubeh
Hello,

You can have the list of table names  from pg_class  such as 

SELECT c.relname as "Name" ,
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r') AND n.nspname NOT IN  ('pg_toast', 'pg_catalog', 
'information_schema') ;

So, if you want to store all table names you can just create a view around the  
above query.

Also , the following is not a perfect solution;  but I think it will work.  You 
can join your table with pg_class to filter out tables which are Dropped 


Regards 




 From: Richard Huxton 
To: Michael Graham  
Cc: "pgsql-general@postgresql.org"  
Sent: Friday, April 26, 2013 5:21 PM
Subject: Re: [GENERAL] Table containing only valid table names
 

On 26/04/13 16:09, Michael Graham wrote:
> I'm pretty sure I can't do what I need as postgres doesn't support
> triggers on DDL but maybe I'm wrong.

If you're still in development and not live, it'll be worth checking out 9.3

http://www.postgresql.org/docs/devel/static/event-triggers.html


-- 
   Richard Huxton
   Archonet Ltd


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] NOTIFY channel

2013-04-26 Thread Perry Smith
Hi,

I'm curious what the design assumptions of the LISTEN / NOTIFY are.  For 
example, would it work to have a thousand LISTEN channels open at the same time 
or would that be abusing the assumptions that were made?  In short, I'm trying 
to understand if I need a few central LISTEN / NOTIFY channels that then 
distribute the events out or not.

Thank you for your time,
Perry



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] NOTIFY channel

2013-04-26 Thread Merlin Moncure
On Fri, Apr 26, 2013 at 11:30 AM, Perry Smith  wrote:
> Hi,
>
> I'm curious what the design assumptions of the LISTEN / NOTIFY are.  For 
> example, would it work to have a thousand LISTEN channels open at the same 
> time or would that be abusing the assumptions that were made?  In short, I'm 
> trying to understand if I need a few central LISTEN / NOTIFY channels that 
> then distribute the events out or not.

Couple quick points about notify:
1) In somewhat recent postgres (9.0) notify implementation was
significantly optimized and expanded to include a payload.

2) notify does not work with pgbouncer (yet), although I have in the
past had a private pgbouncer with functioning notify.

3) notifies are not delivered until end of transaction, which is
sometimes useful, sometimes annoying

4) delivery of notifications is 'best effort', but that effort is
pretty much always successful.

5) listener's in the range of 1000's is ok.  but don't be afraid to
think about using payload and/or putting specific instructions into a
table that the client listener acts on after receiving notify.

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is there a way to deliver an array over column from a query window?

2013-04-26 Thread Rafał Pietrak

W dniu 04/26/2013 05:25 PM, Tom Lane pisze:

=?ISO-8859-2?Q?Rafa=B3_Pietrak?=  writes:

array_agg(distinct v order by v) -- works in postgres, but actually I need:
array_agg(distinct v order by v,x) -- which doesn't. (ERROR:
expressions must appear in argument list),

Why do you think you need that?  AFAICS, the extra order-by column could
not in any way affect the result of the operation.


In my particular case (e.g. not in general, since I assume, we all 
agree, that people do sort things comming out of the query for one 
purpose or another), is that:

1. the information i retrieve (the V), is a telephone number.
2. my database does keep numerous contact information (e.g. telephone 
numbers, email, etc) for "entities" registered here - e.g 
people/companies leave contact information of various relevance: 
my-private, my-office, my-lawyer, etc.
3. when I need to get in touch with somebody, I need to choose the 
number that is "most relevant" - one person leaves "my-private" phone, 
and "my-lawyer"  phone; the other leaves "my-office", and "my-lawyer".
4. in the above example I'd like to peek: "my-private" for the first 
person, and "my-office" for the other. I wouldn't like to relay on 
randomness provided by the database query plan.
5. so I have "the other" column (the X, e.g "my-something"), that I'd 
like to sort the array elements by. And peek just the first element of 
the array.


BTW: I've just rid off the array, and cooked a plain table join with 
"distinct on ()", which gives just what I needed. My initial plan of 
using array was to reduce the intermediate row-sets as much as possible 
as early as possible. Yet, in this case, plain old RDB joins proved to 
be better (may be not faster - a big multitable join is formed along the 
query, but conceptually cleaner, which works for me, the database isn't 
terribly big).


So I have my problem solved, although I haven't figured out a way to 
have controll over the sort order of array_agg() result - which might be 
otherwise usefull.


thnx,

-R



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] NOTIFY channel

2013-04-26 Thread Tom Lane
Perry Smith  writes:
> I'm curious what the design assumptions of the LISTEN / NOTIFY are.  For 
> example, would it work to have a thousand LISTEN channels open at the same 
> time or would that be abusing the assumptions that were made?  In short, I'm 
> trying to understand if I need a few central LISTEN / NOTIFY channels that 
> then distribute the events out or not.

In the current implementation (since 9.0 IIRC), I don't believe the
total number of channels in use affects the transport mechanism at all.
However, the number that a particular backend is listening on would
affect how fast it can process the messages --- we just run through the
local LISTEN-names list serially for a match to see if an incoming
message is interesting or not.  So I'd try to avoid having individual
sessions listening to very large numbers of channels.  OTOH, the serial
list scan might still be faster than anything you could easily do in SQL
or plpgsql to sort through messages you were trying to route manually.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find current row number relative to window frame

2013-04-26 Thread Tom Lane
Art Ruszkowski  writes:
> I have a user defined aggregate function and in Sfunc I need to
> reference current row number relative to window frame. Ideallly I would like 
> to have following construct:

> select my_aggregate(x,current_row_number_relative_to window) over
> (order by y rows between n preceding and current row) as. from …. 

Couldn't you make the aggregate track the current row number as part of
its state?

While it's a bit annoying to have to do that, I don't see any other
solution that doesn't involve converting the aggregate into a
full-fledged window function, which you probably don't want to do.
(Even then, nodeWindowAgg.c doesn't seem to expose a function for
row number within frame, only row number within partition.)

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Where to set search_path

2013-04-26 Thread Jeff Janes
I've have an application which uses the public schema for all of its
relations.

I want to change the name of the schema (in preparation to merging the
database into that of another application, which also uses the public
schema for its tables).

I could qualify all of the tables and sequence with the schema names in the
source code SQL, but that is a lot of tedious work and I don't think it
even counts as an improvement.  I think I'd rather use search_path in the
application being migrated, and only use fully qualified names for the
(small) part of the code that will explicitly need to bridge both systems.

Once I rename the schema ("alter schema public rename to new_schema;") I
can set the search path either on the db side, or on the app side.  That
is, by "alter role web_user set search_path TO new_schema", or by changing
the centralized subroutine used by the application to get a database
handle, something like this:

sub getdbh {
  require DBI;
  my $dbh = DBI->connect(..., {AutoCommit=>1, RaiseError=>1,
PrintError=>0});
  $dbh->do("set search_path to new_schema");
  return $dbh;
};

Is there a reason to choose one of these options over the other?  Or is it
purely a matter of taste?

I'm leaning towards the latter method, because it seems the future
application maintainer is more likely to benefit from the clue about the
search_path than the future DBA (assuming those roles get split).

Thanks,

Jeff


Re: [GENERAL] NOTIFY channel

2013-04-26 Thread Misa Simic
"
2) notify does not work with pgbouncer (yet), although I have in the
past had a private pgbouncer with functioning notify.
"

Is there a plan to make it work?

Actually - stupid question - probably not important... But good to know...

Listen connection doesn't  need to work via pgbouncer...

Thanks,

Misa

On Friday, April 26, 2013, Merlin Moncure wrote:

> On Fri, Apr 26, 2013 at 11:30 AM, Perry Smith 
> >
> wrote:
> > Hi,
> >
> > I'm curious what the design assumptions of the LISTEN / NOTIFY are.  For
> example, would it work to have a thousand LISTEN channels open at the same
> time or would that be abusing the assumptions that were made?  In short,
> I'm trying to understand if I need a few central LISTEN / NOTIFY channels
> that then distribute the events out or not.
>
> Couple quick points about notify:
> 1) In somewhat recent postgres (9.0) notify implementation was
> significantly optimized and expanded to include a payload.
>
> 2) notify does not work with pgbouncer (yet), although I have in the
> past had a private pgbouncer with functioning notify.
>
> 3) notifies are not delivered until end of transaction, which is
> sometimes useful, sometimes annoying
>
> 4) delivery of notifications is 'best effort', but that effort is
> pretty much always successful.
>
> 5) listener's in the range of 1000's is ok.  but don't be afraid to
> think about using payload and/or putting specific instructions into a
> table that the client listener acts on after receiving notify.
>
> merlin
>
>
> --
> Sent via pgsql-general mailing list 
> (pgsql-general@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] is there a way to deliver an array over column from a query window?

2013-04-26 Thread Misa Simic
 SELECT DISTINCT a, b, c, array_agg(d)  OVER (PARTITION BY c )  FROM

(

SELECT a, b, c, d FROM  testy where e <> 'email' and c='1035049'  ORDER BY
 a, b, c, e

) t

Doesnt give u desired result?


On Friday, April 26, 2013, Rafał Pietrak wrote:

> W dniu 04/26/2013 05:25 PM, Tom Lane pisze:
>
>> =?ISO-8859-2?Q?Rafa=B3_**Pietrak?=  writes:
>>
>>> array_agg(distinct v order by v) -- works in postgres, but actually I
>>> need:
>>> array_agg(distinct v order by v,x) -- which doesn't. (ERROR:
>>> expressions must appear in argument list),
>>>
>> Why do you think you need that?  AFAICS, the extra order-by column could
>> not in any way affect the result of the operation.
>>
>
> In my particular case (e.g. not in general, since I assume, we all agree,
> that people do sort things comming out of the query for one purpose or
> another), is that:
> 1. the information i retrieve (the V), is a telephone number.
> 2. my database does keep numerous contact information (e.g. telephone
> numbers, email, etc) for "entities" registered here - e.g people/companies
> leave contact information of various relevance: my-private, my-office,
> my-lawyer, etc.
> 3. when I need to get in touch with somebody, I need to choose the number
> that is "most relevant" - one person leaves "my-private" phone, and
> "my-lawyer"  phone; the other leaves "my-office", and "my-lawyer".
> 4. in the above example I'd like to peek: "my-private" for the first
> person, and "my-office" for the other. I wouldn't like to relay on
> randomness provided by the database query plan.
> 5. so I have "the other" column (the X, e.g "my-something"), that I'd like
> to sort the array elements by. And peek just the first element of the array.
>
> BTW: I've just rid off the array, and cooked a plain table join with
> "distinct on ()", which gives just what I needed. My initial plan of using
> array was to reduce the intermediate row-sets as much as possible as early
> as possible. Yet, in this case, plain old RDB joins proved to be better
> (may be not faster - a big multitable join is formed along the query, but
> conceptually cleaner, which works for me, the database isn't terribly big).
>
> So I have my problem solved, although I haven't figured out a way to have
> controll over the sort order of array_agg() result - which might be
> otherwise usefull.
>
> thnx,
>
> -R
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/**mailpref/pgsql-general
>


Re: [GENERAL] Where to set search_path

2013-04-26 Thread David Johnston
Jeff Janes wrote
> I've have an application which uses the public schema for all of its
> relations.
> 
> I want to change the name of the schema (in preparation to merging the
> database into that of another application, which also uses the public
> schema for its tables).
> 
> I could qualify all of the tables and sequence with the schema names in
> the
> source code SQL, but that is a lot of tedious work and I don't think it
> even counts as an improvement.  I think I'd rather use search_path in the
> application being migrated, and only use fully qualified names for the
> (small) part of the code that will explicitly need to bridge both systems.
> 
> Once I rename the schema ("alter schema public rename to new_schema;") I
> can set the search path either on the db side, or on the app side.  That
> is, by "alter role web_user set search_path TO new_schema", or by changing
> the centralized subroutine used by the application to get a database
> handle, something like this:
> 
> sub getdbh {
>   require DBI;
>   my $dbh = DBI->connect(..., {AutoCommit=>1, RaiseError=>1,
> PrintError=>0});
>   $dbh->do("set search_path to new_schema");
>   return $dbh;
> };
> 
> Is there a reason to choose one of these options over the other?  Or is it
> purely a matter of taste?
> 
> I'm leaning towards the latter method, because it seems the future
> application maintainer is more likely to benefit from the clue about the
> search_path than the future DBA (assuming those roles get split).
> 
> Thanks,
> 
> Jeff

I don't really have any support for my thinking but I'd much rather put this
kind of setup on the database user and not the application.  At worse you
MAY want to validate what the current search_path is at the client level. 
My thinking would be along the lines of most users should not care what
schemas they are using and so should be also to simply logon and have things
work.  The users who do care about bridging between the two worlds should
have the search_path default to something invalid and in doing so force
those users to be explicit regarding which schemas they intend to work with.

My $0.02

David J




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Where-to-set-search-path-tp5753452p5753462.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is there a way to deliver an array over column from a query window?

2013-04-26 Thread Rafał Pietrak

W dniu 04/26/2013 09:54 PM, Misa Simic pisze:

SELECT DISTINCT a, b, c, array_agg(d)  OVER (PARTITION BY c )  FROM

(

SELECT a, b, c, d FROM  testy where e <> 'email' and c='1035049' 
 ORDER BY  a, b, c, e


) t

Doesnt give u desired result?


Hmm... actualy, it looks like it does. I wouldn't thought, that the sort 
order is maintaned from subquery, but if it does, this is just it.


It looks like I've just overdone the solution.

-R





On Friday, April 26, 2013, Rafał Pietrak wrote:

W dniu 04/26/2013 05:25 PM, Tom Lane pisze:

=?ISO-8859-2?Q?Rafa=B3_Pietrak?=  writes:

array_agg(distinct v order by v) -- works in postgres, but
actually I need:
array_agg(distinct v order by v,x) -- which doesn't. (ERROR:
expressions must appear in argument list),

Why do you think you need that?  AFAICS, the extra order-by
column could
not in any way affect the result of the operation.


In my particular case (e.g. not in general, since I assume, we all
agree, that people do sort things comming out of the query for one
purpose or another), is that:
1. the information i retrieve (the V), is a telephone number.
2. my database does keep numerous contact information (e.g.
telephone numbers, email, etc) for "entities" registered here -
e.g people/companies leave contact information of various
relevance: my-private, my-office, my-lawyer, etc.
3. when I need to get in touch with somebody, I need to choose the
number that is "most relevant" - one person leaves "my-private"
phone, and "my-lawyer"  phone; the other leaves "my-office", and
"my-lawyer".
4. in the above example I'd like to peek: "my-private" for the
first person, and "my-office" for the other. I wouldn't like to
relay on randomness provided by the database query plan.
5. so I have "the other" column (the X, e.g "my-something"), that
I'd like to sort the array elements by. And peek just the first
element of the array.

BTW: I've just rid off the array, and cooked a plain table join
with "distinct on ()", which gives just what I needed. My initial
plan of using array was to reduce the intermediate row-sets as
much as possible as early as possible. Yet, in this case, plain
old RDB joins proved to be better (may be not faster - a big
multitable join is formed along the query, but conceptually
cleaner, which works for me, the database isn't terribly big).

So I have my problem solved, although I haven't figured out a way
to have controll over the sort order of array_agg() result - which
might be otherwise usefull.

thnx,

-R



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)

To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





[GENERAL] Optimizing bulk update performance

2013-04-26 Thread Yang Zhang
It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:

UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
id = constid

(We're just overwriting fields of objects identified by ID.)

The tables have handfuls of indices each and no foreign key constraints.
No COMMIT is made till the end.

It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
baseline we should reasonably target.

Short of producing a custom program that somehow reconstructs a dataset
for Postgresql to re-import, is there anything we can do to bring the
bulk UPDATE performance closer to that of the import?  (This is an area
that we believe log-structured merge trees handle well, but we're
wondering if there's anything we can do within Postgresql.)

Some ideas:

- dropping all non-ID indices and rebuilding afterward?
- increasing checkpoint_segments, but does this actually help sustained
  long-term throughput?
- using the techniques mentioned here?  (Load new data as table, then
  "merge in" old data where ID is not found in new data)
  


Basically there's a bunch of things to try and we're not sure what the
most effective are or if we're overlooking other things.  We'll be
spending the next few days experimenting, but we thought we'd ask here
as well.

Thanks.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Gavin Flower

On 27/04/13 12:14, Yang Zhang wrote:

It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:

 UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
 id = constid

(We're just overwriting fields of objects identified by ID.)

The tables have handfuls of indices each and no foreign key constraints.
No COMMIT is made till the end.

It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
baseline we should reasonably target.

Short of producing a custom program that somehow reconstructs a dataset
for Postgresql to re-import, is there anything we can do to bring the
bulk UPDATE performance closer to that of the import?  (This is an area
that we believe log-structured merge trees handle well, but we're
wondering if there's anything we can do within Postgresql.)

Some ideas:

- dropping all non-ID indices and rebuilding afterward?
- increasing checkpoint_segments, but does this actually help sustained
   long-term throughput?
- using the techniques mentioned here?  (Load new data as table, then
   "merge in" old data where ID is not found in new data)
   


Basically there's a bunch of things to try and we're not sure what the
most effective are or if we're overlooking other things.  We'll be
spending the next few days experimenting, but we thought we'd ask here
as well.

Thanks.


People will need to know your version of Postgres & which Operating 
System etc. plus details of CPU RAM, and Disks... AS well as what 
changes you have made to postgresql.conf...


I would be inclined to DROP all indexes and reCREATE them later.

Updating a row might lead to new row being added in a new disk page, so 
I suspect that updates will hit every index associated with the table 
with the (possible exception of partial indexes).


Running too many updates in one transaction, maymean that Postgres may 
need to use disk work files.


Depending on RAM etc, it may pay to increase some variables tat affect 
how Postgres uses RAM, some of theseare per session.



Cheers,
Gavin


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Yang Zhang
We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
m1.xlarge instances, which have:

15 GiB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
64-bit platform

(Yes, we're moving to EBS Optimized instances + Provisioned IOPS
volumes, but prelim. benchmarks suggest this won't get us enough of a
boost as much as possibly refactoring the way we're executing these
bulk updates in our application.)

On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
 wrote:
> On 27/04/13 12:14, Yang Zhang wrote:
>
> It currently takes up to 24h for us to run a large set of UPDATE
> statements on a database, which are of the form:
>
> UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
> id = constid
>
> (We're just overwriting fields of objects identified by ID.)
>
> The tables have handfuls of indices each and no foreign key constraints.
> No COMMIT is made till the end.
>
> It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
> baseline we should reasonably target.
>
> Short of producing a custom program that somehow reconstructs a dataset
> for Postgresql to re-import, is there anything we can do to bring the
> bulk UPDATE performance closer to that of the import?  (This is an area
> that we believe log-structured merge trees handle well, but we're
> wondering if there's anything we can do within Postgresql.)
>
> Some ideas:
>
> - dropping all non-ID indices and rebuilding afterward?
> - increasing checkpoint_segments, but does this actually help sustained
>   long-term throughput?
> - using the techniques mentioned here?  (Load new data as table, then
>   "merge in" old data where ID is not found in new data)
>
> 
>
> Basically there's a bunch of things to try and we're not sure what the
> most effective are or if we're overlooking other things.  We'll be
> spending the next few days experimenting, but we thought we'd ask here
> as well.
>
> Thanks.
>
>
> People will need to know your version of Postgres & which Operating System
> etc. plus details of CPU RAM, and Disks... AS well as what changes you have
> made to postgresql.conf...
>
> I would be inclined to DROP all indexes and reCREATE them later.
>
> Updating a row might lead to new row being added in a new disk page, so I
> suspect that updates will hit every index associated with the table with the
> (possible exception of partial indexes).
>
> Running too many updates in one transaction, may mean that Postgres may need
> to use disk work files.
>
> Depending on RAM etc, it may pay to increase some variables tat affect how
> Postgres uses RAM, some of these are per session.
>
>
> Cheers,
> Gavin



--
Yang Zhang
http://yz.mit.edu/


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Basic question on recovery and disk snapshotting

2013-04-26 Thread Yang Zhang
We're running on EBS volumes on EC2.  We're interested in leveraging
EBS snapshotting for backups.  However, does this mean we'd need to
ensure our pg_xlog is on the same EBS volume as our data?

(I believe) the usual reasoning for separating pg_xlog onto a separate
volume is for performance.  However, if they are on different volumes,
the snapshots may be out of sync.

Thanks.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Gavin Flower
Please do not top post, the convention in these list are to add stuff at 
the end, apart from comments interspersed to make use of appropriate 
context!


On 27/04/13 13:35, Yang Zhang wrote:

We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
m1.xlarge instances, which have:

15 GiB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
64-bit platform

(Yes, we're moving to EBS Optimized instances + Provisioned IOPS
volumes, but prelim. benchmarks suggest this won't get us enough of a
boost as much as possibly refactoring the way we're executing these
bulk updates in our application.)

On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
 wrote:

On 27/04/13 12:14, Yang Zhang wrote:

It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:

 UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
 id = constid

(We're just overwriting fields of objects identified by ID.)

The tables have handfuls of indices each and no foreign key constraints.
No COMMIT is made till the end.

It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
baseline we should reasonably target.

Short of producing a custom program that somehow reconstructs a dataset
for Postgresql to re-import, is there anything we can do to bring the
bulk UPDATE performance closer to that of the import?  (This is an area
that we believe log-structured merge trees handle well, but we're
wondering if there's anything we can do within Postgresql.)

Some ideas:

- dropping all non-ID indices and rebuilding afterward?
- increasing checkpoint_segments, but does this actually help sustained
   long-term throughput?
- using the techniques mentioned here?  (Load new data as table, then
   "merge in" old data where ID is not found in new data)



Basically there's a bunch of things to try and we're not sure what the
most effective are or if we're overlooking other things.  We'll be
spending the next few days experimenting, but we thought we'd ask here
as well.

Thanks.


People will need to know your version of Postgres & which Operating System
etc. plus details of CPU RAM, and Disks... AS well as what changes you have
made to postgresql.conf...

I would be inclined to DROP all indexes and reCREATE them later.

Updating a row might lead to new row being added in a new disk page, so I
suspect that updates will hit every index associated with the table with the
(possible exception of partial indexes).

Running too many updates in one transaction, may mean that Postgres may need
to use disk work files.

Depending on RAM etc, it may pay to increase some variables tat affect how
Postgres uses RAM, some of these are per session.


Cheers,
Gavin



--
Yang Zhang
http://yz.mit.edu/




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Yang Zhang
On Fri, Apr 26, 2013 at 7:01 PM, Gavin Flower
 wrote:
> Please do not top post, the convention in these list are to add stuff at the
> end, apart from comments interspersed to make use of appropriate context!

Noted, thanks.

Anyway, any performance hints are greatly appreciated.

>
>
> On 27/04/13 13:35, Yang Zhang wrote:
>>
>> We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
>> m1.xlarge instances, which have:
>>
>> 15 GiB memory
>> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
>> 64-bit platform
>>
>> (Yes, we're moving to EBS Optimized instances + Provisioned IOPS
>> volumes, but prelim. benchmarks suggest this won't get us enough of a
>> boost as much as possibly refactoring the way we're executing these
>> bulk updates in our application.)
>>
>> On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
>>  wrote:
>>>
>>> On 27/04/13 12:14, Yang Zhang wrote:
>>>
>>> It currently takes up to 24h for us to run a large set of UPDATE
>>> statements on a database, which are of the form:
>>>
>>>  UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
>>>  id = constid
>>>
>>> (We're just overwriting fields of objects identified by ID.)
>>>
>>> The tables have handfuls of indices each and no foreign key constraints.
>>> No COMMIT is made till the end.
>>>
>>> It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
>>> baseline we should reasonably target.
>>>
>>> Short of producing a custom program that somehow reconstructs a dataset
>>> for Postgresql to re-import, is there anything we can do to bring the
>>> bulk UPDATE performance closer to that of the import?  (This is an area
>>> that we believe log-structured merge trees handle well, but we're
>>> wondering if there's anything we can do within Postgresql.)
>>>
>>> Some ideas:
>>>
>>> - dropping all non-ID indices and rebuilding afterward?
>>> - increasing checkpoint_segments, but does this actually help sustained
>>>long-term throughput?
>>> - using the techniques mentioned here?  (Load new data as table, then
>>>"merge in" old data where ID is not found in new data)
>>>
>>>
>>> 
>>>
>>> Basically there's a bunch of things to try and we're not sure what the
>>> most effective are or if we're overlooking other things.  We'll be
>>> spending the next few days experimenting, but we thought we'd ask here
>>> as well.
>>>
>>> Thanks.
>>>
>>>
>>> People will need to know your version of Postgres & which Operating
>>> System
>>> etc. plus details of CPU RAM, and Disks... AS well as what changes you
>>> have
>>> made to postgresql.conf...
>>>
>>> I would be inclined to DROP all indexes and reCREATE them later.
>>>
>>> Updating a row might lead to new row being added in a new disk page, so I
>>> suspect that updates will hit every index associated with the table with
>>> the
>>> (possible exception of partial indexes).
>>>
>>> Running too many updates in one transaction, may mean that Postgres may
>>> need
>>> to use disk work files.
>>>
>>> Depending on RAM etc, it may pay to increase some variables tat affect
>>> how
>>> Postgres uses RAM, some of these are per session.
>>>
>>>
>>> Cheers,
>>> Gavin
>>
>>
>>
>> --
>> Yang Zhang
>> http://yz.mit.edu/
>
>



--
Yang Zhang
http://yz.mit.edu/


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Tom Lane
Yang Zhang  writes:
> It currently takes up to 24h for us to run a large set of UPDATE
> statements on a database, which are of the form:

> UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
> id = constid

> (We're just overwriting fields of objects identified by ID.)

Forgive the obvious question, but you do have an index on "id", right?
Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?

> The tables have handfuls of indices each and no foreign key constraints.

How much is a "handful"?

> It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
> baseline we should reasonably target.

Well, maybe.  You didn't say what percentage of the DB you're updating.

But the thing that comes to mind here is that you're probably incurring
a network round trip for each row, and maybe a query-planning round as
well, so you really can't expect that this is going to be anywhere near
as efficient as a bulk load operation.  You could presumably get rid of
the planner overhead by using a prepared statement.  Cutting the network
overhead is going to require a bit more ingenuity --- could you move
some logic into a stored procedure, perhaps, so that one command from
the client is sufficient to update multiple rows?

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general