On Mon, Jun 1, 2009 at 1:25 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Andrew Smith <laconi...@gmail.com> writes: > > I'm a beginner when it comes to Postgresql, and have a table design > question > > about a project I'm currently working on. I have 1500 data items that > need > > to be copied every minute from an external system into my database. The > > items have a timestamp, an identifier and a value. For example: > > > 12/10/2008 05:00 ID_ABC 14 > > 12/10/2008 05:01 ID_ABC 17 > > 12/10/2008 05:02 ID_ABC 13 > > > Pretty simple stuff. The current solution (implemented using SQL Server > a > > few years ago) looks like this (an approximation using Postgresql > syntax): > > > CREATE TABLE "DataImport" > > ( > > "DataImportID" serial NOT NULL PRIMARY KEY, > > "Time" timestamp without time zone NOT NULL, > > "ID_ABC" integer NOT NULL, > > "ID_DEF" integer NOT NULL, > > "ID_HIJ" integer NOT NULL, > > etc > > ); > > So the table would have ~1500 columns? You definitely don't want to do > it that way in Postgres --- you'd be way too close to the maximum column > count limitation. >
After my original post, I found out that the current solution in SQL Server actually had to be split into two different tables due to a similar limitation. > > > > My initial thought for the design of the new solution looks like this: > > > CREATE TABLE "DataImport" > > ( > > "DataImportID" serial NOT NULL PRIMARY KEY, > > "Time" timestamp without time zone NOT NULL, > > "Identifier" text NOT NULL, > > "Value" integer NOT NULL > > ); > > DataImportID couldn't be a primary key here, could it? Or is it just > meant as an artificial primary key? If so, consider not bothering with > it at all --- (Time, Identifier) seems like a perfectly good natural > key, and with such short records the extra space for a serial column > is not negligible. > You're right, I guess there doesn't seem to be much point having that surrogate key in place...and it does take up space. I did a quick test and got the following figures: 1 million records with DataImportID = ~80mb 1 million records without DataImportID = ~50mb. That'll certainly add up over time. > Anyway, my answer would definitely lean towards using this normalized > representation, if all the data values are integers. (If they're not, > it gets messier...) > > > Users will then be doing regular queries on this data (say, a few hundred > > times per day), such as: > > > SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' > AND > > "Time" between '2008-11-07' and '2008-11-11'; > > An index on (Identifier, Time) (not the other way around) should work > well for that type of query. I'll give it a try. > > > > My concern is that 1500 values * 14400 minutes per day = 21,600,000 > > records. Add this up over the course of a month (the length of time I > need > > to keep the data in this table) and I'll have more than half a billion > > records being stored in there. > > That's not an especially big table. However it might be worth your > trouble to use partitioning. Not to speed searches, particularly, but > to make it easier to drop 1/30th of the data each day. Re-reading this I noticed that I had an extra 0 in one of my figures - there's only 1440 minutes in a day, so my table gets down to the much more manageable 2.16 million records per day instead of 21.6 million. I'll have a look into partitioning too, thanks for your help! Cheers, Andrew