Thank you all very much for your thorough replies. To be honest, many of the things you mention I had not heard of before. It seems that the clustering by index must be the way to go. I have to consult now if most queries will benefit from it. I do not control directly the installation parameters of the database, I do not have root access to that server but I may be able to find another one where I can controll these parameteres. The configuratoin parameters I should look to increase are the ones related to shared memory then? I know it is some sort of worst combination, to have such a large database with an inexperienced dba as myself :)
r. On Mon, May 25, 2009 at 12:17 PM, Alban Hertroys < dal...@solfertje.student.utwente.nl> wrote: > On May 25, 2009, at 10:58 AM, Ramiro Diaz Trepat wrote: > > The table with the atmosphere pixels, currently has about 140MM records, >> and the one the values about 1000MM records. They should both grow to >> about twice this size. >> > > Did you tune postgres to use the available resources? By default it comes > with rather modest settings so it will run on low-spec hardware without > trouble. For a dataset this size modest hardware is obviously out of the > question, and so the default config will likely not be suitable. > > >> A simple query with no joins like this: >> >> select trajectory from atmospherepoint where moment='1979-05-02 11:45:00' >> and height >= 12000 and height <= 14000 and longitude >= 180 and longitude >> <= 190 and latitude >= 0 and latitude <= 10; >> > > I think the BETWEEN operator would make that query a bit easier on the > eyes, like so: > > select trajectory from atmospherepoint where moment='1979-05-02 11:45:00' > and height between 12000 and 14000 and longitude between 180 and 190 and > latitude between 0 and 10; > > is taking almost 4 minutes in a decent multi core server. Also, the >> moment equality test should yield no more than 2MM rows, so I thought that >> should make things a lot faster. >> >> The explain returns the suspicious "Seq Scan" up front: >> >> Seq Scan on atmospherepoint (cost=0.00..5461788.08 rows=134 width=8) >> Filter: ((moment = '1979-05-02 11:45:00'::timestamp without time zone) >> AND (height >= 12000::double precision) AND (height <= 14000::double >> precision) AND (longitude >= 180::double precision) AND (longitude <= >> 190::double precision) AND (latitude >= 0::double precision) AND (latitude >> <= 10::double precision)) >> > > That query is expecting 134 rows, how accurate is that? An explain analyse > will show you that (and will take 4 minutes in this case). > > I think you'll need lots of memory if you want to keep those indexes in > memory. If I do a rough byte-count of the datatypes involved times the > number of records I think just the index data (w/o metadata) takes up 40B * > 140M rows = 5.6GB. Scanning an index on disk is likely to be slow, which may > well be one of the reasons PG decides on a sequential scan. > > What is the distribution of the data you're querying for? If it's all over > the place then the planner made a pretty good guess a sequential scan is > most optimal; random disk I/O isn't very optimal, sequential I/O is usually > much faster. > > Before solving that you'll want to figure out whether this is a typical > query, or that distributing the data differently will hurt other queries. If > it's a typical query, then clustering your data on an index and/or > partitioning will help. If it's not, it still will help this type of query > but it may hurt other types of queries. You'll want to optimise into the > right direction. > > As other people said already, partitioning is likely to help here. Your > data seems very suitable for partitioning too. I think I'd divide the > coordinates in a number of latitude/longitude slices (a grid seems suitable > to me, but I don't deal with atmospheric data). With 32 slices you'll have > <500k records per slice on average, whether that's sufficiently small you'll > have to test. > > Alternatively, if your data is clustered (on disk) by say > longitude/latitude it may already help to define partial indexes on > longitude latitude ranges, for example: > > CREATE INDEX atm_section1_idx (longitude, latitude, height) WHERE longitude > BETWEEN 180 AND 190 AND latitude BETWEEN 0 AND 10; > > This is very similar to partitioning your data, but it doesn't require > splitting the data up across multiple partitions (tables). This could be > faster if you would have queries spanning multiple table-partitions, but if > your data isn't clustered appropriately finding it on disk will require more > random I/O than a partitioned layout. > Choices choices... The only way to find out what works best is to test, I'm > afraid. > > but I have created indices for every column in the table and I have also >> runned ANALIZE and VACUUM: >> >> Table "public.atmospherepoint2" >> Column | Type | Modifiers >> ------------+-----------------------------+----------- >> id | integer | not null >> trajectory | integer | not null >> moment | timestamp without time zone | not null >> longitude | real | not null >> latitude | real | not null >> height | real | not null >> Indexes: >> "atmospherepoint2_pkey" PRIMARY KEY, btree (id) >> "atm_height_idx" btree (height) >> "atm_latitude_idx" btree (latitude) >> "atm_longitude_idx" btree (longitude) >> "atm_moment_idx" btree (moment) >> "atm_trajectory_idx" btree (trajectory) >> Foreign-key constraints: >> "atmospherepoint2_trajectory_fkey" FOREIGN KEY (trajectory) REFERENCES >> trajectory2(id) >> > > You do realise that floating point arithmetic isn't accurate, do you? If > that matters to you you're probably better off using the numeric type, > although apparently FP arithmetic is faster. > > That aside, you're probably better off with a postgis type for the > co-ordinates, as that's designed for this kind of data. > > I will welcome a lot any advice or pointer on how to tune these tables to >> work faster. >> The tables don't change at all once the data has been loaded, they are >> only used for queries. >> Thank you very much. >> > > > > Alban Hertroys > > -- > If you can't see the forest for the trees, > cut the trees and you'll see there is no forest. > > > !DSPAM:894,4a1a7e4a10098724910796! > > >