On Mon, Jun 1, 2009 at 1:25 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Andrew Smith <laconi...@gmail.com> writes:
> > I'm a beginner when it comes to Postgresql, and have a table design
> question
> > about a project I'm currently working on.  I have 1500 data items that
> need
> > to be copied every minute from an external system into my database.  The
> > items have a timestamp, an identifier and a value.  For example:
>
> > 12/10/2008 05:00       ID_ABC      14
> > 12/10/2008 05:01       ID_ABC      17
> > 12/10/2008 05:02       ID_ABC      13
>
> > Pretty simple stuff.  The current solution (implemented using SQL Server
> a
> > few years ago) looks like this (an approximation using Postgresql
> syntax):
>
> > CREATE TABLE "DataImport"
> > (
> >   "DataImportID" serial NOT NULL PRIMARY KEY,
> >   "Time" timestamp without time zone NOT NULL,
> >   "ID_ABC" integer NOT NULL,
> >   "ID_DEF" integer NOT NULL,
> >   "ID_HIJ" integer NOT NULL,
> >   etc
> > );
>
> So the table would have ~1500 columns?  You definitely don't want to do
> it that way in Postgres --- you'd be way too close to the maximum column
> count limitation.
>

After my original post, I found out that the current solution in SQL Server
actually had to be split into two different tables due to a similar
limitation.


>
>
> > My initial thought for the design of the new solution looks like this:
>
> > CREATE TABLE "DataImport"
> > (
> >   "DataImportID" serial NOT NULL PRIMARY KEY,
> >   "Time" timestamp without time zone NOT NULL,
> >   "Identifier" text NOT NULL,
> >   "Value" integer NOT NULL
> > );
>
> DataImportID couldn't be a primary key here, could it?  Or is it just
> meant as an artificial primary key?  If so, consider not bothering with
> it at all --- (Time, Identifier) seems like a perfectly good natural
> key, and with such short records the extra space for a serial column
> is not negligible.
>

You're right, I guess there doesn't seem to be much point having that
surrogate key in place...and it does take up space.  I did a quick test and
got the following figures:

1 million records with DataImportID = ~80mb
1 million records without DataImportID = ~50mb.

That'll certainly add up over time.


> Anyway, my answer would definitely lean towards using this normalized
> representation, if all the data values are integers.  (If they're not,
> it gets messier...)
>
> > Users will then be doing regular queries on this data (say, a few hundred
> > times per day), such as:
>
> > SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC'
> AND
> > "Time" between '2008-11-07' and '2008-11-11';
>
> An index on (Identifier, Time) (not the other way around) should work
> well for that type of query.


I'll give it a try.


>
>
> > My concern is that 1500 values * 14400 minutes per day = 21,600,000
> > records.  Add this up over the course of a month (the length of time I
> need
> > to keep the data in this table) and I'll have more than half a billion
> > records being stored in there.
>
> That's not an especially big table.  However it might be worth your
> trouble to use partitioning.  Not to speed searches, particularly, but
> to make it easier to drop 1/30th of the data each day.


Re-reading this I noticed that I had an extra 0 in one of my figures -
there's only 1440 minutes in a day, so my table gets down to the much more
manageable 2.16 million records per day instead of 21.6 million.  I'll have
a look into partitioning too, thanks for your help!

Cheers,

Andrew

Reply via email to