Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Jochen Erwied
Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 5 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100 
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..5)
{
$i += rand(2000);
print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, 
rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id 
integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
   set id=test.id
   from test
   where 
(t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
   set count=t_imp.count
   from t_imp
   where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
   select t_value,t_record,output_id,count
  from t_imp
  where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the 
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 
GHz, table and indices stored on a SSD)

Table statistics:

relid | 14332525
schemaname| public
relname   | test
seq_scan  | 8
seq_tup_read  | 111541821
idx_scan  | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 30264431
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Pierre C



It's a fairly tricky problem. I have a number of sensors producing
energy data about every 5 minutes, but at random times between 1 and
15 minutes. I can't change that as that's the way the hardware of the
sensors works. These feed into another unit, which accumulates them
and forwards them in batches over the Internet to my PostgreSQL
database server every few minutes (again at random times outside my
control and with random batch sizes). To make things worse, if the
Internet connection between the unit and the database server fails, it
will send the latest data first to provide a quick update to the
current values and then send the backlog of stored values. Thus, data
do not always arrive in correct time order.


I'm stuck home with flu, so I'm happy to help ;)

I'll build an example setup to make it clearer...

-- A list of all sensors
create table sensors( sensor_id integer primary key );
insert into sensors select generate_series(1,100);

-- A table to contain raw sensor data
create table log(
  sensor_id integer not null references sensors(sensor_id),
  time integer not null,
  value float not null
);

-- Fill it up with test data
insert into log
select sensor_id, time, time from (
  select distinct sensor_id,
(n+random()*10)::INTEGER as time
  from generate_series(0,5,5) n
   cross join sensors
) d;

-- index it
alter table log add primary key( time, sensor_id );
create index log_sensor_time on log( sensor_id, time );

select * from log where sensor_id=1 order by time;
 sensor_id | time  | value
---+---+---
 1 |12 |12
 1 |14 |14
 1 |21 |21
 1 |29 |29
 1 |30 |30
()
 1 | 49996 | 49996
 1 | 50001 | 50001

-- create a table which will contain the time ticks
-- which will be used as x-axis for interpolation
-- (in this example, one tick every 10 time units)

create table ticks( time integer primary key,
   check( time%10 = 0 ) );
insert into ticks select
  generate_series( 0, (select max(time) from log), 10 );

-- create interpolated values table
create table interp(
  sensor_id integer not null references sensors( sensor_id ),
  time  integer not null references ticks( time ),
  value float,
  distance  integer not null
);

-- fill interpolated values table
-- (pretty slow)

insert into interp
select
sensor_id,
t.time,
start_value +  
(end_value-start_value)*(t.time-start_time)/(end_time-start_time),

greatest( t.time - start_time, end_time-t.time )
  from
(select
  sensor_id,
  lag(time) over (partition by sensor_id order by time) as start_time,
  time as end_time,
  lag(value) over (partition by sensor_id order by time) as  
start_value,

  value as end_value
from log
) as l
  join ticks t on (t.time >= start_time and t.time < end_time);

-- alternate query if you don't like the ticks table (same sesult) :
insert into interp
select
sensor_id,
time,
start_value +  
(end_value-start_value)*(time-start_time)/(end_time-start_time),

greatest( time - start_time, end_time-time )
  from
  (select
*,
generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS  
time

from
  (select
sensor_id,
lag(time) over (partition by sensor_id order by time) as  
start_time,

time as end_time,
lag(value) over (partition by sensor_id order by time) as  
start_value,

value as end_value
  from log
  ) as l
) l;

alter table interp add primary key( time,sensor_id );
create index interp_sensor_time on interp( sensor_id, time );

For each interval in the log table that contains a time tick, this query  
generates the interpolated data at that tick.


Note that the "distance" field represents the distance (in time) between  
the interpolated value and the farthest real data point that was used to  
calculate it. Therefore, it can be used as a measure of the quality of the  
interpolated point ; if the distance is greater than some threshold, the  
value might not be that precise.


Now, suppose we receive a bunch of data. The data isn't ordered according  
to time.

There are two possibilities :

- the new data starts right where we left off (ie, just after the last  
time for each sensor in table log)
- the new data starts later in time, and we want to process the results  
right away, expecting to receive, at some later point, older data to fill  
the holes


The second one is hairier, lets' do that.

Anyway, let's create a packet :

-- A table to contain raw sensor data
create temporary table packet(
  sensor_id integer not null,
  time integer not null,
  value float not null
);

-- Fill it up with test data
insert into packet
select sensor_id, time, time from (
  select distinct sensor_id,
(n+random()*10)::INTEGER as time
  from generate_series(50200,50400) n
   cross join sensors
) d;

Note that I deliberately inserted a hole : the log table con

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Marc Mamin
Yes, but it should become a bit slower if you fix your code :-)

  where t_imp.id is null and test.id=t_imp.id;
  =>
  where t_imp.id is not null and test.id=t_imp.id;

and a partial index on matching rows might help (should be tested):

 (after the first updat)
 create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is 
not null.

regards,
Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Jochen Erwied
Gesendet: Sa 1/7/2012 12:57
An: anto...@inaps.org
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Duplicate deletion optimizations
 
Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 5 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100 
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..5)
{
$i += rand(2000);
print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, 
rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id 
integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
   set id=test.id
   from test
   where 
(t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
   set count=t_imp.count
   from t_imp
   where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
   select t_value,t_record,output_id,count
  from t_imp
  where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the 
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 
GHz, table and indices stored on a SSD)

Table statistics:

relid | 14332525
schemaname| public
relname   | test
seq_scan  | 8
seq_tup_read  | 111541821
idx_scan  | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 30264431
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Misa Simic
If solution with temp table is acceptable - i think steps could be
reduced...

• copy to temp_imp ( temp table does not have id column)

• update live set count = temp_imp.count from temp_imp using (
col1,col2,col3)

• insert into live from temp where col1, col2 and col3 not exists in
live

Kind Regards,

Misa

Sent from my Windows Phone
From: Jochen Erwied
Sent: 07/01/2012 12:58
To: anto...@inaps.org
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Duplicate deletion optimizations
Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 5 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..5)
{
$i += rand(2000);
print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255,
$i%255, rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record
integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
   set id=test.id
   from test
   where 
(t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
   set count=t_imp.count
   from t_imp
   where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
   select t_value,t_record,output_id,count
  from t_imp
  where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6
GHz, table and indices stored on a SSD)

Table statistics:

relid | 14332525
schemaname| public
relname   | test
seq_scan  | 8
seq_tup_read  | 111541821
idx_scan  | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 30264431
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Jochen Erwied
Saturday, January 7, 2012, 3:02:10 PM you wrote:

> • insert into live from temp where col1, col2 and col3 not exists in
> live

'not exists' is something I'm trying to avoid, even if the optimizer is 
able to handle it. 

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Pierre C



It's a fairly tricky problem. I have a number of sensors producing
energy data about every 5 minutes, but at random times between 1 and
15 minutes. I can't change that as that's the way the hardware of the
sensors works. These feed into another unit, which accumulates them
and forwards them in batches over the Internet to my PostgreSQL
database server every few minutes (again at random times outside my
control and with random batch sizes). To make things worse, if the
Internet connection between the unit and the database server fails, it
will send the latest data first to provide a quick update to the
current values and then send the backlog of stored values. Thus, data
do not always arrive in correct time order.


I'm stuck home with flu, so I'm happy to help ;)

I'll build an example setup to make it clearer...

-- A list of all sensors
create table sensors( sensor_id integer primary key );
insert into sensors select generate_series(1,100);

-- A table to contain raw sensor data
create table log(
sensor_id integer not null references sensors(sensor_id),
time integer not null,
value float not null
);

-- Fill it up with test data
insert into log
select sensor_id, time, time from (
select distinct sensor_id,
  (n+random()*10)::INTEGER as time
from generate_series(0,5,5) n
 cross join sensors
) d;

-- index it
alter table log add primary key( time, sensor_id );
create index log_sensor_time on log( sensor_id, time );

select * from log where sensor_id=1 order by time;
   sensor_id | time  | value
---+---+---
   1 |12 |12
   1 |14 |14
   1 |21 |21
   1 |29 |29
   1 |30 |30
()
   1 | 49996 | 49996
   1 | 50001 | 50001

-- create a table which will contain the time ticks
-- which will be used as x-axis for interpolation
-- (in this example, one tick every 10 time units)

create table ticks( time integer primary key,
 check( time%10 = 0 ) );
insert into ticks select
generate_series( 0, (select max(time) from log), 10 );

-- create interpolated values table
create table interp(
sensor_id integer not null references sensors( sensor_id ),
time  integer not null references ticks( time ),
value float,
distance  integer not null
);

-- fill interpolated values table
-- (pretty slow)

insert into interp
select
  sensor_id,
  t.time,
  start_value +
(end_value-start_value)*(t.time-start_time)/(end_time-start_time),
  greatest( t.time - start_time, end_time-t.time )
from
  (select
sensor_id,
lag(time) over (partition by sensor_id order by time) as  
start_time,

time as end_time,
lag(value) over (partition by sensor_id order by time) as
start_value,
value as end_value
  from log
  ) as l
join ticks t on (t.time >= start_time and t.time < end_time);

-- alternate query if you don't like the ticks table (same sesult) :
insert into interp
select
  sensor_id,
  time,
  start_value +
(end_value-start_value)*(time-start_time)/(end_time-start_time),
  greatest( time - start_time, end_time-time )
from
(select
  *,
  generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 )  
AS

time
  from
(select
  sensor_id,
  lag(time) over (partition by sensor_id order by time) as
start_time,
  time as end_time,
  lag(value) over (partition by sensor_id order by time) as
start_value,
  value as end_value
from log
) as l
  ) l;

alter table interp add primary key( time,sensor_id );
create index interp_sensor_time on interp( sensor_id, time );

For each interval in the log table that contains a time tick, this query
generates the interpolated data at that tick.

Note that the "distance" field represents the distance (in time) between
the interpolated value and the farthest real data point that was used to
calculate it. Therefore, it can be used as a measure of the quality of the
interpolated point ; if the distance is greater than some threshold, the
value might not be that precise.

Now, suppose we receive a bunch of data. The data isn't ordered according
to time.
There are two possibilities :

- the new data starts right where we left off (ie, just after the last
time for each sensor in table log)
- the new data starts later in time, and we want to process the results
right away, expecting to receive, at some later point, older data to fill
the holes

The second one is hairier, lets' do that.

Anyway, let's create a packet :

-- A table to contain raw sensor data
create temporary table packet(
sensor_id integer not null,
time integer not null,
value float not null
);

-- Fill it up with test data
insert into packet
select sensor_id, time, time from (
select distinct sensor_id,
  (n+random()*10)::INTEGER as time
from generate_series(50200,50400) n

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Jeff Janes
On Fri, Jan 6, 2012 at 6:35 AM,   wrote:
> Hello,
>
> I've a table with approximately 50 million rows with a schema like this:
>
>    id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
>    t_value integer NOT NULL DEFAULT 0,
>    t_record integer NOT NULL DEFAULT 0,
>    output_id integer NOT NULL DEFAULT 0,
>    count bigint NOT NULL DEFAULT 0,
>    CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
>
> Every 5 minutes, a process have to insert a few thousand of rows in this
> table,
> but sometime, the process have to insert an already existing row (based on
> values in the triplet (t_value, t_record, output_id). In this case, the row
> must be updated with the new count value. I've tried some solution given on
> this
> stackoverflow question [1] but the insertion rate is always too low for my
> needs.

What are your needs?  It should take no special hardware or coding to
be able to manage a few thousand rows over 5 minutes.


> So, I've decided to do it in two times:
>
>  - I insert all my new data with a COPY command
>  - When it's done, I run a delete query to remove oldest duplicates
>
> Right now, my delete query look like this:
>
>    SELECT min(id) FROM stats_5mn
>    GROUP BY t_value, t_record, output_id
>    HAVING count(*) > 1;
>
> The duration of the query on my test machine with approx. 16 million rows is
> ~18s.
>
> To reduce this duration, I've tried to add an index on my triplet:
>
>    CREATE INDEX test
>      ON stats_5mn
>      USING btree
>      (t_value , t_record , output_id );
>
> By default, the PostgreSQL planner doesn't want to use my index and do a
> sequential
> scan [2], but if I force it with "SET enable_seqscan = off", the index is
> used [3]
> and query duration is lowered to ~5s.
>
>
> My questions:
>
>  - Why the planner refuse to use my index?

It thinks that using the index will be about 9 times more expensive
than the full scan.  Probably your settings for seq_page_cost and
random_page_cost are such that the planner thinks that nearly every
buffer read is going to be from disk.  But in reality (in this case)
your data is all in memory.  So the planner is mis-estimating.  (It
would help verify this if you did your EXPLAIN ANALYZE with BUFFERS as
well).  But before trying to fix this by tweaking settings, will the
real case always be like your test case?  If the data stops being all
in memory, either because the problem size increases or because you
have to compete for buffer space with other things going on, then
using the index scan could be catastrophic.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Misa Simic
It was not query... Just sentence where some index values in one table
not exist in another...

So query could be with:
• WHERE (col1,col2,col2) NOT IN
• WHERE NOT EXISTS
• LEFT JOIN live USING (col1,col2,col2) WHERE live.id IS NULL

what ever whoever prefer more or what gives better results... But I
think it is more personal feelings which is better then real...

Sent from my Windows Phone
From: Jochen Erwied
Sent: 07/01/2012 15:18
To: Misa Simic
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Duplicate deletion optimizations
Saturday, January 7, 2012, 3:02:10 PM you wrote:

> • insert into live from temp where col1, col2 and col3 not exists in
> live

'not exists' is something I'm trying to avoid, even if the optimizer is
able to handle it.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Marc Eberhard
Hi Pierre!

On 7 January 2012 12:20, Pierre C  wrote:
> I'm stuck home with flu, so I'm happy to help ;)
[...]
> I'll build an example setup to make it clearer...
[...]

That's almost identical to my tables. :-)

> Note that the "distance" field represents the distance (in time) between the
> interpolated value and the farthest real data point that was used to
> calculate it. Therefore, it can be used as a measure of the quality of the
> interpolated point ; if the distance is greater than some threshold, the
> value might not be that precise.

Nice idea!

> Although this query is huge, it's very fast, since it doesn't hit the big
> tables with any seq scans (hence the max() and min() tricks to use the
> indexes instead).

And it can easily be tamed by putting parts of it into stored pgpsql functions.

> I love how postgres can blast that huge pile of SQL in, like, 50 ms...

Yes, indeed. It's incredible fast. Brilliant!

> If there is some overlap between packet data and data already in the log,
> you might get some division by zero errors, in this case you'll need to
> apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION,
> which might be wiser anyway...)

I do have a unique constraint on the actual table to prevent duplicate
data in case of retransmission after a failed connect. It's easy
enough to delete the rows from packet that already exist in the main
table with a short one line SQL delete statement before the
interpolation and merge.

> Tada.

:-

> Enjoy !

I certainly will. Many thanks for those great lines of SQL!

Hope you recover from your flu quickly!

All the best,
Marc

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance