On Thu, Oct 20, 2011 at 8:42 AM, Jeff Adams <jeff.ad...@noaa.gov> wrote:
> Thanks for the reply and assistance. I share your concern that the approach > may be slow. I am not adverse to creating some sort of table to store the > time periods with years for the temporal extent of the data if that would > speed up the process. In that situation there would be multiple records > (one > record for each year) that it would need to check against. I suppose I > could > then create some sort of aggregate to then count the number of records in > which the date falls within the range. Anything over 0 would indicate that > the date does fall within the range? > > From: Adam Cornett [mailto:adam.corn...@gmail.com] > Sent: Wednesday, October 19, 2011 6:51 PM > To: Jeff Adams > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Date Range Using Months and Days Only > > > On Wed, Oct 19, 2011 at 12:40 PM, Jeff Adams <jeff.ad...@noaa.gov> wrote: > Greetings, > > I have to write a query on a fairly large table of data (>100 million rows) > where I need to check to see if a date (epoch) falls between a range of > values. The catch is that the range is defined only by month and day > values. > For example the record containing the epoch value will be linked to a table > that containing columns named start_month, start_day, end_month, end_day > that define the range. With respect to the range, year does not matter, > however, some of the ranges will start in November and end in April of the > next year. Has anyone come across this type of query? I could certainly > write a function or even include criteria in a query that would extract > date > parts of the epoch and then compare against the values in the start_month, > start_day, end_month, end_day (it might get complex with respect to ranges > where the start year and end year are different), but I am worried about > performance. I thought I would seek some input before I floundered through > the many iterations of poor performing alternatives! Any thoughts would be > greatly appreciated. > > Thanks in advance... > Jeff > > > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > > create table a ( > id_a integer, > epoch integer > ); > > create table b ( > id_b integer, > start_month integer, > start_day integer, > end_month integer, > end_day integer > ); > CREATE OR REPLACE FUNCTION get_timestamps(sm integer, sd integer, em > integer, ed integer, year integer) > RETURNS timestamp without time zone[] AS > $BODY$ > declare > syear integer := year; > eyear integer := year; > tstamps timestamp[]; > begin > if(sm>em) then > -- assume that since the end month is less than the start month is in > the next year > eyear := eyear+1; > end if; > tstamps[0] := (syear|| '-' || sm || '-' || sd)::timestamp; > tstamps[1] := (eyear|| '-' || em || '-' || ed)::timestamp; > return tstamps; > > end > $BODY$ > LANGUAGE plpgsql STABLE; > > create view a_timestamp as > SELECT id_a, TIMESTAMP WITH TIME ZONE 'epoch' + epoch * INTERVAL '1 second' > as tstamp from a; > > with ab as( > select > id_a,id_b,get_timestamps(b.start_month,b.start_day,b.end_month,b.end_day, > extract(year from a.tstamp)::integer) ts_arr,a.tstamp > from a_timestamp as a,b > ) > select * from ab > where ab.tstamp between ts_arr[0] and ts_arr[1] > > This obviously isn't a fast solution to your problem, although converting > the integer epoch to a timestamp in table a would eliminate the view > a_timestamp and you can index the column for some speed up, the real > problem > you're facing is that your ranges don't have years, otherwise you could > store everything as a timestamp and then just join using 'between' and > postgres would just need to do an index scan on each table. > > -Adam > > Here is an updated method, it involves creating a third table to store the timestamp values from table 'b' to allow for better lookups, I've also added a timestamp column to table 'a' (which can be calculated from the epoch using the view in my previous email. create table a ( id_a integer, epoch integer, ts timestamp ); CREATE INDEX ON a (ts); create table b ( id_b integer, start_month integer, start_day integer, end_month integer, end_day integer ); CREATE OR REPLACE FUNCTION get_timestamps(sm integer, sd integer, em integer, ed integer, year integer) RETURNS timestamp without time zone[] AS $BODY$ declare syear integer := year; eyear integer := year; tstamps timestamp[]; begin if(sm>em) then -- assume that since the end month is less than the start month is in the next year eyear := eyear+1; end if; tstamps[0] := (syear|| '-' || sm || '-' || sd)::timestamp; tstamps[1] := (eyear|| '-' || em || '-' || ed)::timestamp; return tstamps; end $BODY$ LANGUAGE plpgsql STABLE; create table b_calc ( id_b integer, year integer, range_start timestamp, range_end timestamp ); CREATE INDEX ON b_calc (range_start, range_end); -- generate the ranges for the last 10 years insert into b_calc with bb as( select id_b,get_timestamps(b.start_month,b.start_day,b.end_month,b.end_day, yr) ts_arr, yr from b, generate_series(1991,2011) as yr ) select id_b,yr,ts_arr[0],ts_arr[1] from bb; -- the actual query; explain shows that it uses the two indexes to match up the ranges select id_a,id_b from a, b_calc b where a.ts between b.range_start and b.range_end