Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Kevin Grittner
Alessandro Gagliardi wrote: > All of our servers run in UTC specifically to avoid this sort of > problem. It's kind of annoying actually, because we're a San > Francisco company and so whenever I have to do daily analytics, I > have to shift everything to Pacific. But in this case it's handy.

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
Ah, yes, that makes sense. Thank you! On Thu, Mar 1, 2012 at 11:39 AM, Claudio Freire wrote: > On Thu, Mar 1, 2012 at 4:35 PM, Alessandro Gagliardi > wrote: > > Interesting solution. If I'm not mistaken, this does solve the problem of > > having two entries for the same user at the exact same ti

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Claudio Freire
On Thu, Mar 1, 2012 at 4:39 PM, Claudio Freire wrote: >> Interesting solution. If I'm not mistaken, this does solve the problem of >> having two entries for the same user at the exact same time (which violates >> my pk constraint) but it does so by leaving both of them out (since there is >> no au

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Claudio Freire
On Thu, Mar 1, 2012 at 4:35 PM, Alessandro Gagliardi wrote: > Interesting solution. If I'm not mistaken, this does solve the problem of > having two entries for the same user at the exact same time (which violates > my pk constraint) but it does so by leaving both of them out (since there is > no

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
Interesting solution. If I'm not mistaken, this does solve the problem of having two entries for the same user at the exact same time (which violates my pk constraint) but it does so by leaving both of them out (since there is no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right? On

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
Hah! Yeah, that might would work. Except that I suck at grep. :( Perhaps that's a weakness I should remedy. On Thu, Mar 1, 2012 at 10:35 AM, Craig James wrote: > On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi > wrote: > > Hi folks, > > > > I have a system that racks up about 40M log line

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
All of our servers run in UTC specifically to avoid this sort of problem. It's kind of annoying actually, because we're a San Francisco company and so whenever I have to do daily analytics, I have to shift everything to Pacific. But in this case it's handy. Thanks for the keen eye though. On Thu,

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
I was thinking of adding an index, but thought it would be pointless since I would only be using the index once before dropping the table (after its loaded into hourly_activity). I assumed it would take longer to create the index and then use it than to just seq scan once or twice. Am I wrong in th

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Kevin Grittner
Alessandro Gagliardi wrote: > hr_timestamp timestamp without time zone, In addition to the responses which more directly answer your question, I feel I should point out that this will not represent a single moment in time. At the end of Daylight Saving Time, the value will jump backward and

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Peter van Hardenberg
On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi wrote: > Now, I want to reduce that data to get the last activity that was performed > by each user in any given hour. It should fit into a table like this: > How about: 1) Create an expression based index on date_trunc('hour', hr_timestamp)

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Craig James
On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi wrote: > Hi folks, > > I have a system that racks up about 40M log lines per day. I'm able to COPY > the log files into a PostgreSQL table that looks like this: Since you're using a COPY command and the table has a simple column with exactly t

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Claudio Freire
On Thu, Mar 1, 2012 at 3:27 PM, Alessandro Gagliardi wrote: > INSERT INTO hourly_activity >     SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour, > activity_unlogged.user_id, >                     client_ip, hr_timestamp, locale, log_id, method, > server_ip, uri, user_agent >    

[PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
Hi folks, I have a system that racks up about 40M log lines per day. I'm able to COPY the log files into a PostgreSQL table that looks like this: CREATE TABLE activity_unlogged ( user_id character(24) NOT NULL, client_ip inet, hr_timestamp timestamp without time zone, locale character var