[PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Ramon Bastiaans
Hi all,
I am doing research for a project of mine where I need to store several 
billion values for a monitoring and historical tracking system for a big 
computer system. My currect estimate is that I have to store (somehow) 
around 1 billion values each month (possibly more).

I was wondering if anyone has had any experience with these kind of big 
numbers of data in a postgres sql database and how this affects database 
design and optimization.

What would be important issues when setting up a database this big, and 
is it at all doable? Or would it be a insane to think about storing up 
to 5-10 billion rows in a postgres database.

The database's performance is important. There would be no use in 
storing the data if a query will take ages. Query's should be quite fast 
if possible.

I would really like to hear people's thoughts/suggestions or "go see a 
shrink, you must be mad" statements ;)

Kind regards,
Ramon Bastiaans

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Jeff
On Mar 1, 2005, at 4:34 AM, Ramon Bastiaans wrote:
What would be important issues when setting up a database this big, 
and is it at all doable? Or would it be a insane to think about 
storing up to 5-10 billion rows in a postgres database.

Buy a bunch of disks.
And then go out and buy more disks.
When you are done with that - go buy some more disks.
Then buy some ram.
Then buy more disks.
You want the fastest IO possible.
I'd also recommend the opteron route since you can also put heaping 
gobules of ram in there as well.

The database's performance is important. There would be no use in 
storing the data if a query will take ages. Query's should be quite 
fast if possible.

And make sure you tune your queries.
--
Jeff Trout <[EMAIL PROTECTED]>
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Markus Schaber
Hi, Ramon,

Ramon Bastiaans schrieb:

> The database's performance is important. There would be no use in
> storing the data if a query will take ages. Query's should be quite fast
> if possible.

Which kind of query do you want to run?

Queries that involve only a few rows should stay quite fast when you set
up the right indices.

However, queries that involve sequential scans over your table (like
average computation) will take years. Get faast I/O for this. Or,
better, use a multidimensional data warehouse engine. Those can
precalculate needed aggregate functions and reports. But they need loads
of storage (because of very redundant data storage), and I don't know
any open source or cheap software.

Markus
-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread John Arbash Meinel
Ramon Bastiaans wrote:
Hi all,
I am doing research for a project of mine where I need to store
several billion values for a monitoring and historical tracking system
for a big computer system. My currect estimate is that I have to store
(somehow) around 1 billion values each month (possibly more).
If you have that 1 billion perfectly distributed over all hours of the
day, then you need 1e9/30/24/3600 = 385 transactions per second.
Which I'm pretty sure is possible with postgres, you just need pretty
beefy hardware. And like Jeff said, lots of disks for lots of IO.
Like a quad opteron, with 16GB of ram, and around 14-20 very fast disks.
raid10 not raid5, etc. To improve query performance, you can do some
load balancing by having replication machines by using Slony.
Or if you can do batch processing, you could split up the work into a
few update machines, which then do bulk updates on the master database.
This lets you get more machines into the job, since you can't share a
database across multiple machines.
I was wondering if anyone has had any experience with these kind of
big numbers of data in a postgres sql database and how this affects
database design and optimization.
Well, one of the biggest things is if you can get bulk updates, or if
clients can handle data being slightly out of date, so you can use
cacheing. Can you segregate your data into separate tables as much as
possible? Are your clients okay if aggregate information takes a little
while to update?
One trick is to use semi-lazy materialized views to get your updates to
be fast.
What would be important issues when setting up a database this big,
and is it at all doable? Or would it be a insane to think about
storing up to 5-10 billion rows in a postgres database.
I think you if you can design the db properly, it is doable. But if you
have a clients saying "I need up to the second information on 1 billion
rows", you're never going to get it.
The database's performance is important. There would be no use in
storing the data if a query will take ages. Query's should be quite
fast if possible.
Again, it depends on the queries being done.
There are some nice tricks you can use, like doing a month-by-month
partitioning (if you are getting 1G inserts, you might want week-by-week
partitioning), and then with a date column index, and a union all view
you should be able to get pretty good insert speed, and still keep fast
*recent* queries. Going through 1billion rows is always going to be
expensive.
I would really like to hear people's thoughts/suggestions or "go see a
shrink, you must be mad" statements ;)
Kind regards,
Ramon Bastiaans
I think it would be possible, but there are a lot of design issues with
a system like this. You can't go into it thinking that you can design a
multi billion row database the same way you would design a million row db.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Vig, Sandor (G/FI-2)
385 transaction/sec? 

fsync = false

risky but fast.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of John Arbash
Meinel
Sent: Tuesday, March 01, 2005 4:19 PM
To: Ramon Bastiaans
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] multi billion row tables: possible or insane?


Ramon Bastiaans wrote:

> Hi all,
>
> I am doing research for a project of mine where I need to store
> several billion values for a monitoring and historical tracking system
> for a big computer system. My currect estimate is that I have to store
> (somehow) around 1 billion values each month (possibly more).
>
If you have that 1 billion perfectly distributed over all hours of the
day, then you need 1e9/30/24/3600 = 385 transactions per second.

Which I'm pretty sure is possible with postgres, you just need pretty
beefy hardware. And like Jeff said, lots of disks for lots of IO.
Like a quad opteron, with 16GB of ram, and around 14-20 very fast disks.
raid10 not raid5, etc. To improve query performance, you can do some
load balancing by having replication machines by using Slony.

Or if you can do batch processing, you could split up the work into a
few update machines, which then do bulk updates on the master database.
This lets you get more machines into the job, since you can't share a
database across multiple machines.

> I was wondering if anyone has had any experience with these kind of
> big numbers of data in a postgres sql database and how this affects
> database design and optimization.
>
Well, one of the biggest things is if you can get bulk updates, or if
clients can handle data being slightly out of date, so you can use
cacheing. Can you segregate your data into separate tables as much as
possible? Are your clients okay if aggregate information takes a little
while to update?

One trick is to use semi-lazy materialized views to get your updates to
be fast.

> What would be important issues when setting up a database this big,
> and is it at all doable? Or would it be a insane to think about
> storing up to 5-10 billion rows in a postgres database.

I think you if you can design the db properly, it is doable. But if you
have a clients saying "I need up to the second information on 1 billion
rows", you're never going to get it.

>
> The database's performance is important. There would be no use in
> storing the data if a query will take ages. Query's should be quite
> fast if possible.
>
Again, it depends on the queries being done.
There are some nice tricks you can use, like doing a month-by-month
partitioning (if you are getting 1G inserts, you might want week-by-week
partitioning), and then with a date column index, and a union all view
you should be able to get pretty good insert speed, and still keep fast
*recent* queries. Going through 1billion rows is always going to be
expensive.

> I would really like to hear people's thoughts/suggestions or "go see a
> shrink, you must be mad" statements ;)
>
> Kind regards,
>
> Ramon Bastiaans

I think it would be possible, but there are a lot of design issues with
a system like this. You can't go into it thinking that you can design a
multi billion row database the same way you would design a million row db.

John
=:->


The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread John Arbash Meinel
Vig, Sandor (G/FI-2) wrote:
385 transaction/sec?
fsync = false
risky but fast.

I think with a dedicated RAID10 for pg_xlog (or possibly a battery
backed up ramdisk), and then a good amount of disks in a bulk RAID10 or
possibly a good partitioning of the db across multiple raids, you could
probably get a good enough tps.
But you're right, fsync=false could certainly give you the performance,
though a power outage means potential *real* corruption. Not just
missing transactions, but duplicated rows, all sorts of ugliness.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Andras Kadinger
What do your "values" consist of?

Would it be possible to group several hundred or thousand of them into a
single row somehow that still makes it possible for your queries to get at 
them efficiently? 

What kind of queries will you want to run against the data?

For example if you have a measurement of some process value each
millisecond, it might be a good performance tradeoff to pack a whole
second of measurements into a single row if your data processing only
needs to access the values sequentially. With this single step you
immediately reduced your row and transaction number to the 1/1000th.

Please tell us more.

On Tue, 1 Mar 2005, Ramon Bastiaans wrote:

> Hi all,
> 
> I am doing research for a project of mine where I need to store several 
> billion values for a monitoring and historical tracking system for a big 
> computer system. My currect estimate is that I have to store (somehow) 
> around 1 billion values each month (possibly more).
> 
> I was wondering if anyone has had any experience with these kind of big 
> numbers of data in a postgres sql database and how this affects database 
> design and optimization.
> 
> What would be important issues when setting up a database this big, and 
> is it at all doable? Or would it be a insane to think about storing up 
> to 5-10 billion rows in a postgres database.
> 
> The database's performance is important. There would be no use in 
> storing the data if a query will take ages. Query's should be quite fast 
> if possible.
> 
> I would really like to hear people's thoughts/suggestions or "go see a 
> shrink, you must be mad" statements ;)
> 
> Kind regards,
> 
> Ramon Bastiaans
> 
> 
> 
> ---(end of broadcast)---
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
> 


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Alan Stange
Isn't that 385 rows/second.   Presumably one can insert more than one 
row in a transaction?

-- Alan
Vig, Sandor (G/FI-2) wrote:
385 transaction/sec? 

fsync = false
risky but fast.
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of John Arbash
Meinel
Sent: Tuesday, March 01, 2005 4:19 PM
To: Ramon Bastiaans
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] multi billion row tables: possible or insane?
Ramon Bastiaans wrote:
 

Hi all,
I am doing research for a project of mine where I need to store
several billion values for a monitoring and historical tracking system
for a big computer system. My currect estimate is that I have to store
(somehow) around 1 billion values each month (possibly more).
   

If you have that 1 billion perfectly distributed over all hours of the
day, then you need 1e9/30/24/3600 = 385 transactions per second.
Which I'm pretty sure is possible with postgres, you just need pretty
beefy hardware. And like Jeff said, lots of disks for lots of IO.
Like a quad opteron, with 16GB of ram, and around 14-20 very fast disks.
raid10 not raid5, etc. To improve query performance, you can do some
load balancing by having replication machines by using Slony.
Or if you can do batch processing, you could split up the work into a
few update machines, which then do bulk updates on the master database.
This lets you get more machines into the job, since you can't share a
database across multiple machines.
 

I was wondering if anyone has had any experience with these kind of
big numbers of data in a postgres sql database and how this affects
database design and optimization.
   

Well, one of the biggest things is if you can get bulk updates, or if
clients can handle data being slightly out of date, so you can use
cacheing. Can you segregate your data into separate tables as much as
possible? Are your clients okay if aggregate information takes a little
while to update?
One trick is to use semi-lazy materialized views to get your updates to
be fast.
 

What would be important issues when setting up a database this big,
and is it at all doable? Or would it be a insane to think about
storing up to 5-10 billion rows in a postgres database.
   

I think you if you can design the db properly, it is doable. But if you
have a clients saying "I need up to the second information on 1 billion
rows", you're never going to get it.
 

The database's performance is important. There would be no use in
storing the data if a query will take ages. Query's should be quite
fast if possible.
   

Again, it depends on the queries being done.
There are some nice tricks you can use, like doing a month-by-month
partitioning (if you are getting 1G inserts, you might want week-by-week
partitioning), and then with a date column index, and a union all view
you should be able to get pretty good insert speed, and still keep fast
*recent* queries. Going through 1billion rows is always going to be
expensive.
 

I would really like to hear people's thoughts/suggestions or "go see a
shrink, you must be mad" statements ;)
Kind regards,
Ramon Bastiaans
   

I think it would be possible, but there are a lot of design issues with
a system like this. You can't go into it thinking that you can design a
multi billion row database the same way you would design a million row db.
John
=:->
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
 


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Markus Schaber
Hi, John,

John Arbash Meinel schrieb:

>> I am doing research for a project of mine where I need to store
>> several billion values for a monitoring and historical tracking system
>> for a big computer system. My currect estimate is that I have to store
>> (somehow) around 1 billion values each month (possibly more).
>>
> If you have that 1 billion perfectly distributed over all hours of the
> day, then you need 1e9/30/24/3600 = 385 transactions per second.

I hope that he does not use one transaction per inserted row.

In your in-house tests, we got a speedup factor of up to some hundred
when bundling rows on insertions. The fastest speed was with using
bunches of some thousand rows per transaction, and running about 5
processes in parallel.

Regard the usual performance tips: Use a small, but fast-writing RAID
for transaction log (no RAID-5 or RAID-6 variants), possibly a mirroring
of two harddisk-backed SSD. Use different disks for the acutal data
(here, LVM2 with growing volumes could be very handy). Have enough RAM.
Use a fast file system.

BTW, as you read about the difficulties that you'll face with this
enormous amount of data: Don't think that your task will much be easier
or cheaper using any other DBMS, neither commercial nor open source. For
all of them, you'll need "big iron" hardware, and a skilled team of
admins to set up and maintain the database.

Markus

-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Inheritence versus delete from

2005-03-01 Thread Sven Willenberger
On Tue, 2005-03-01 at 09:48 -0600, John Arbash Meinel wrote:
> Sven Willenberger wrote:
> 
> >Trying to determine the best overall approach for the following
> >scenario:
> >
> >Each month our primary table accumulates some 30 million rows (which
> >could very well hit 60+ million rows per month by year's end). Basically
> >there will end up being a lot of historical data with little value
> >beyond archival.
> >
> >The question arises then as the best approach of which I have enumerated
> >three:
> >
> 
> I just thought of another possibility. You could create each table
> month-by-month, and then use a view to combine them, and possibly a rule
> to keep things clean.
> 
> So you would do something like:
> 
> I will assume you already have the data in one big table to show the
> easiest way to create the small tables.
> 
> create table tblname-2005-01 as select * from orig_tbl where day >=
> '2005-01-01' and day < '2005-02-01';
> create table tblname-2005-02 as select * from orig_tbl where day >=
> '2005-02-01' and day < '2005-03-01';
> create table tblname-2005-03 as select * from orig_tbl where day >=
> '2005-03-01' and day < '2005-04-01';
> -- create appropriate indicies, rules, constraints on these tables
> 
> Then you create a view which includes all of these tables.
> 
> create or replace view tblname as
> select * from tblname-2005-01
> union all select * from tblname-2005-02
> union all select * from tblname-2005-03
> ;
> 
> Then create insert and update rules which fixe which table gets the new
> data.
> 
> create rule up_tblname as on update to tblname do instead
> update tblname-2005-03 set
>col1 = NEW.col1,
>col2 = NEW.col2,
>...
> where id = NEW.id;
> -- This assumes that you have a unique id on your tables. This is just
> whatever your
> -- primary key is, so it should be a decent assumption.
> 
> create rule ins_tblname as on insert to tblname do instead
> insert into tblname-2005-03 (col1, col2, ...)
> values (new.col1, new.col2, ...);
> 
> Now the downside of this method, is that every month you need to create
> a new table, and then update the views and the rules. The update rules
> are pretty straightforward, though.
> 
> The nice thing is that it keeps your data partitioned, and you don't
> ever have a large select/delete step. You probably will want a small one
> each month to keep the data exactly aligned by month. You don't really
> have to have exact alignments, but as humans, we tend to like that stuff. :)
> 
> Probably this is more overhead than you would like to do. Especially if
> you know that you can get away with method 2 (keep 1 big table, and just
> remove old rows out of it every month.)
> 
> But this method means that all of your data stays live, but queries with
> appropriate restrictions should stay fast. You also have the ability
> (with v8.0) to move the individual tables onto separate disks.
> 
> One more time, though, if you can get away with removing old data and
> just archiving it, do so. But if you want to keep the data live, there
> are a couple of alternatives.
> 

Actually that was the thought behind my using inheritance; when querying
the , it basically does a union all; also, I think it would be
quicker to insert directly into the child table (simply by modifying my
query once a month) rather than the overhead sustained by the rule.

Since the children tables are individual tables, all the benefits you
cite above still hold. 

Thanks for the input on this ... will have to try a couple things to see
which is most manageable.\

Sven


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Inheritence versus delete from

2005-03-01 Thread John Arbash Meinel
Sven Willenberger wrote:
On Tue, 2005-03-01 at 09:48 -0600, John Arbash Meinel wrote:

Sven Willenberger wrote:

Trying to determine the best overall approach for the following
scenario:
Each month our primary table accumulates some 30 million rows (which
could very well hit 60+ million rows per month by year's end). Basically
there will end up being a lot of historical data with little value
beyond archival.
The question arises then as the best approach of which I have enumerated
three:

I just thought of another possibility. You could create each table
month-by-month, and then use a view to combine them, and possibly a rule
to keep things clean.
So you would do something like:
I will assume you already have the data in one big table to show the
easiest way to create the small tables.
create table tblname-2005-01 as select * from orig_tbl where day >=
'2005-01-01' and day < '2005-02-01';
create table tblname-2005-02 as select * from orig_tbl where day >=
'2005-02-01' and day < '2005-03-01';
create table tblname-2005-03 as select * from orig_tbl where day >=
'2005-03-01' and day < '2005-04-01';
-- create appropriate indicies, rules, constraints on these tables
Then you create a view which includes all of these tables.
create or replace view tblname as
   select * from tblname-2005-01
   union all select * from tblname-2005-02
   union all select * from tblname-2005-03
;
Then create insert and update rules which fixe which table gets the new
data.
create rule up_tblname as on update to tblname do instead
   update tblname-2005-03 set
  col1 = NEW.col1,
  col2 = NEW.col2,
  ...
   where id = NEW.id;
-- This assumes that you have a unique id on your tables. This is just
whatever your
-- primary key is, so it should be a decent assumption.
create rule ins_tblname as on insert to tblname do instead
   insert into tblname-2005-03 (col1, col2, ...)
   values (new.col1, new.col2, ...);
Now the downside of this method, is that every month you need to create
a new table, and then update the views and the rules. The update rules
are pretty straightforward, though.
The nice thing is that it keeps your data partitioned, and you don't
ever have a large select/delete step. You probably will want a small one
each month to keep the data exactly aligned by month. You don't really
have to have exact alignments, but as humans, we tend to like that stuff. :)
Probably this is more overhead than you would like to do. Especially if
you know that you can get away with method 2 (keep 1 big table, and just
remove old rows out of it every month.)
But this method means that all of your data stays live, but queries with
appropriate restrictions should stay fast. You also have the ability
(with v8.0) to move the individual tables onto separate disks.
One more time, though, if you can get away with removing old data and
just archiving it, do so. But if you want to keep the data live, there
are a couple of alternatives.

Actually that was the thought behind my using inheritance; when querying
the , it basically does a union all; also, I think it would be
quicker to insert directly into the child table (simply by modifying my
query once a month) rather than the overhead sustained by the rule.
Since the children tables are individual tables, all the benefits you
cite above still hold.
Thanks for the input on this ... will have to try a couple things to see
which is most manageable.\
Sven

You're right, child tables to act like that. I just recall that at least
at one point, postgres didn't handle indexes with child tables very
well. That's more just what someone else ran into, so he could have been
doing something wrong.
I agree, if child tables end up doing a union all, then it is much
easier to maintain. A select against the master table should
automatically get all of the child tables.
It might just be that you need to create a new index on the child table
whenever you create it, and then postgres can use that new index to do
the filtering.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread John Arbash Meinel
Markus Schaber wrote:
Hi, John,
John Arbash Meinel schrieb:

I am doing research for a project of mine where I need to store
several billion values for a monitoring and historical tracking system
for a big computer system. My currect estimate is that I have to store
(somehow) around 1 billion values each month (possibly more).

If you have that 1 billion perfectly distributed over all hours of the
day, then you need 1e9/30/24/3600 = 385 transactions per second.

I hope that he does not use one transaction per inserted row.
In your in-house tests, we got a speedup factor of up to some hundred
when bundling rows on insertions. The fastest speed was with using
bunches of some thousand rows per transaction, and running about 5
processes in parallel.

You're right. I guess it just depends on how the data comes in, and what
you can do at the client ends. That is kind of where I was saying put a
machine in front which gathers up the information, and then does a batch
update. If your client can do this directly, then you have the same
advantage.

John
=:->


signature.asc
Description: OpenPGP digital signature


[PERFORM] Confusion about locales and 'like' indexes

2005-03-01 Thread Dan Harris
Greetings,
I have been beating myself up today trying to optimize indices for a 
query that uses LIKE.  In my research I have read that the locale 
setting may affect PostgreSQL's choice of seq scan vs index scan.  I am 
running Fedora Core 2 and it appears when I run "locale" that it is set 
to 'en.US-UTF-8'.

Did I fall into a "gotcha" trap here about C vs non-C locales?  I'm not 
much of a C programmer so I have no idea what all this touches and 
everything has been left as default during PG compilation as well as 
Fedora install.   I can pg_dump and initdb again with --locale=C if 
this will allow my LIKE queries to use indexes, but I just wanted to 
know if there was some other place I needed to change locales in the 
system? e.g. postgresql.conf or env vars?  Or, would the initdb and 
reload alone fix it?

I'm running 8.0.1 if that matters.
Thanks
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Inheritence versus delete from

2005-03-01 Thread Gaetano Mendola
Sven Willenberger wrote:
> Trying to determine the best overall approach for the following
> scenario:
> 
> Each month our primary table accumulates some 30 million rows (which
> could very well hit 60+ million rows per month by year's end). Basically
> there will end up being a lot of historical data with little value
> beyond archival.
> 
> The question arises then as the best approach of which I have enumerated
> three:
> 
> 1) Just allow the records to accumulate and maintain constant vacuuming,
> etc allowing for the fact that most queries will only be from a recent
> subset of data and should be mostly cached.
> 
> 2) Each month:
> SELECT * INTO 3monthsago_dynamically_named_table FROM bigtable WHERE
> targetdate < $3monthsago;
> DELETE FROM bigtable where targetdate < $3monthsago;
> VACUUM ANALYZE bigtable;
> pg_dump 3monthsago_dynamically_named_table for archiving;


In my experience copy/delete in a single transaction 60+ million rows
is not feseable, at least on my 1 GB ram, 2 way CPU box.



Regards
Gaetano Mendola


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Josh Berkus
Ramon,

> What would be important issues when setting up a database this big, and
> is it at all doable? Or would it be a insane to think about storing up
> to 5-10 billion rows in a postgres database.

What's your budget?You're not going to do this on a Dell 2650.   Do you 
have the kind of a budget necessary to purchase/build a good SAN, 
Quad-opteron machine, etc.?   Or at least hire some tuning help?


-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Confusion about locales and 'like' indexes

2005-03-01 Thread Tom Lane
Dan Harris <[EMAIL PROTECTED]> writes:
> query that uses LIKE.  In my research I have read that the locale 
> setting may affect PostgreSQL's choice of seq scan vs index scan.

Non-C-locale indexes can't support LIKE because the sort ordering
isn't necessarily right.

> I am running Fedora Core 2 and it appears when I run "locale" that it
> is set to 'en.US-UTF-8'.

This is not a definitive indication of the environment the database
sees, though.  Try "show lc_collate".

> I can pg_dump and initdb again with --locale=C if 
> this will allow my LIKE queries to use indexes, but I just wanted to 
> know if there was some other place I needed to change locales in the 
> system? e.g. postgresql.conf or env vars?  Or, would the initdb and 
> reload alone fix it?

That would do it.  Alternatively you can create special-purpose indexes
with one of the xxx_pattern_ops operator classes to support LIKE.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Alan Stange
Ramon Bastiaans wrote:
I am doing research for a project of mine where I need to store 
several billion values for a monitoring and historical tracking system 
for a big computer system. My currect estimate is that I have to store 
(somehow) around 1 billion values each month (possibly more).

I was wondering if anyone has had any experience with these kind of 
big numbers of data in a postgres sql database and how this affects 
database design and optimization.

What would be important issues when setting up a database this big, 
and is it at all doable? Or would it be a insane to think about 
storing up to 5-10 billion rows in a postgres database.

The database's performance is important. There would be no use in 
storing the data if a query will take ages. Query's should be quite 
fast if possible.

I would really like to hear people's thoughts/suggestions or "go see a 
shrink, you must be mad" statements ;)
It just dawned on me that we're doing something that, while not the 
same, might be relevant.  One of our tables has ~85M rows in it 
according to the output from an "explain select * from table".  I don't 
plan on trying a select count(*) any time soon :)We add and remove 
about 25M rows a day to/from this table which would be about 750M 
rows/month total.  Given our current usage of the database, it could 
handle a larger row/day rate without too much trouble.  (The problem 
isn't adding rows but deleting rows.)

   Column|   Type   | Modifiers
--+--+---
timeseriesid | bigint   |
bindata  | bytea|
binsize  | integer  |
rateid   | smallint |
ownerid  | smallint |
Indexes:
   "idx_timeseries" btree (timeseriesid)
In this case, each bytea entry is typically about  2KB of data, so the 
total table size is about 150GB, plus some index overhead.

A second table has ~100M rows according to explain select *.  Again it 
has about 30M rows added and removed / day.  

  Column   | Type  | Modifiers
+---+---
uniqid | bigint|
type   | character varying(50) |
memberid   | bigint|
tag| character varying(50) |
membertype | character varying(50) |
ownerid| smallint  |
Indexes:
   "composite_memberid" btree (memberid)
   "composite_uniqid" btree (uniqid)
There are some additional tables that have a few million rows / day of 
activity, so call it 60M rows/day added and removed.  We run a vacuum 
every day.

The box is an dual Opteron 248 from Sun.   Linux 2.6, 8GB of memory.  We 
use reiserfs.  We started with XFS but had several instances of file 
system corruption.  Obviously, no RAID 5.  The xlog is on a 2 drive 
mirror and the rest is on separate mirrored volume.  The drives are 
fiber channel but that was a mistake as the driver from IBM wasn't very 
good.

So, while we don't have a billion rows we do have ~200M total rows in 
all the tables and we're certainly running the daily row count that 
you'd need to obtain.   But scaling this sort of thing up can be tricky 
and your milage may vary.

In a prior career I ran a "data intensive computing center" and helped 
do some design work for a high energy physics experiment:  petabytes of 
data, big tape robots, etc., the usual Big Science toys.   You might 
take a look at ROOT and some of the activity from those folks if you 
don't need transactions and all the features of a general database like 
postgresql.

-- Alan
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Thomas Ganss
Ramon Bastiaans schrieb:
My currect estimate is that I have to store (somehow) 
around 1 billion values each month (possibly more).
You should post the actual number or power of ten,
since "billion" is not always interpreted the same way...
rgds
thomas
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


[PERFORM] What is the postgres sql command for last_user_id ???

2005-03-01 Thread amrit
I would like to know whether there is any command which the server will give the
record ID back to the client when client puts the data and the server generates
an autoincrement ID for that record.
For example if many clients try to put the money data to the server and each
record from each client has its own record ID by autoincrement process of the
server [x+1] and i don't need to lock the record since it will bring the system
to slow down. That client wil then want to know which ID that server gives to
that record in order to select that record to print the reciept [bill].
I know that in mysql there is a command "last_record_id" which acts the same as
I mention above. Does anybody know that , please give me the detail?

Amrit,Thailand



---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Help with tuning this query

2005-03-01 Thread Ken Egervari



I've tried to use Dan Tow's tuning method and 
created all the right indexes from his diagraming method, but the query still 
performs quite slow both inside the application and just inside pgadmin 
III.  Can anyone be kind enough to help me tune it so that it performs 
better in postgres?  I don't think it's using the right indexes, or maybe 
postgres needs special treatment.
 
I've converted the below query to SQL from a 
Hibernate query, so the syntax is probably not perfect but it's semantics 
are exactly the same.  I've done so by 
looking at the source code, but I can't run it to get the exact SQL since I 
don't have the database on my home machine.
 
select s.*from shipment 
s    inner join carrier_code cc on s.carrier_code_id = 
cc.id
    inner join carrier c on 
cc.carrier_id = c.id
    inner join carrier_to_person ctp 
on ctp.carrier_id = c.id
    inner join person p on p.id = 
ctp.person_id
    inner join shipment_status cs on 
s.current_status_id = cs.id
    inner join release_code rc on 
cs.release_code_id = rc.id
    left join shipment_status ss on 
ss.shipment_id = s.idwhere    p.id = :personId 
and    s.is_purged = false and    
rc.number = '9' andcs is not null 
and    cs.date >= current_date - 31order by cs.date 
desc
Just assume I have no indexes for the moment 
because while some of the indexes I made make it work faster, it's still around 
250 milliseconds and under heavy load, the query performs very badly (6-7 
seconds).
 
For your information:
 
shipment contains 40,000 rows
shipment_status contains 80,000 rows
release_code contains 8 rows
person contains 300 rows
carrier contains 60 rows
carrier_code contains 70 rows
 
The filter ratios are:
 
rc.number = '9' (0.125)
cs.date >= current_date - 31 (.10)
p.id = ? (0.003)
s.is_purged = false (.98)
 
I really hope someone can help since I'm pretty 
much stuck.
 
Best regards and many thanks,
Ken