Re: [PERFORM] Duplicate deletion optimizations
Friday, January 6, 2012, 4:21:06 PM you wrote: >> Every 5 minutes, a process have to insert a few thousand of rows in this >> table, but sometime, the process have to insert an already existing row >> (based on values in the triplet (t_value, t_record, output_id). In this >> case, the row must be updated with the new count value. I've tried some >> solution given on this stackoverflow question [1] but the insertion rate >> is always too low for my needs. I did check the following in a loop, starting with an empty table, and inserting/updating 5 random unique entries. After 15 minutes I've got about 10 million records, each loop takes about 3 seconds. After 30 minutes the table contains approx. 18 million entries, time per loop only slightly increased. After 90 minutes the database has about 30 million entries. The speed has dropped to about 15-20 seconds per loop, but the server is doing lots of other queries in parallel, so with an unloaded server the updates should still take less than 10 seconds. The generator runs in perl, and generates records for a maximum of 100 million different entries: use strict; srand time; my $i = 0; open FD, ">data.in"; for (1..5) { $i += rand(2000); print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000)); } close FD; The SQL-script looks like this: \timing on begin; create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint); \copy t_imp (t_value, t_record, output_id, count) from 'data.in' --an index is not really needed, table is in memory anyway --create index t_imp_ix on t_imp(t_value,t_record,output_id); -- find matching rows update t_imp set id=test.id from test where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id); -- update matching rows using primary key update test set count=t_imp.count from t_imp where t_imp.id is null and test.id=t_imp.id; -- insert missing rows insert into test(t_value,t_record,output_id,count) select t_value,t_record,output_id,count from t_imp where id is null; commit; Advantages of this solution: - all updates are done in-place, no index modifications (except for the inserts, of course) - big table only gets inserts - no dead tuples from deletes - completely avoids sequential scans on the big table Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 GHz, table and indices stored on a SSD) Table statistics: relid | 14332525 schemaname| public relname | test seq_scan | 8 seq_tup_read | 111541821 idx_scan | 149240169 idx_tup_fetch | 117901695 n_tup_ins | 30280175 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup| 30264431 n_dead_tup| 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-01-07 12:38:49.593651+01 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 31 The sequential scans were from some 'select count(*)' in between. HTH. -- Jochen Erwied | home: joc...@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: j...@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erw...@vodafone.de +49-173-5404164 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Duplicate deletion optimizations
It's a fairly tricky problem. I have a number of sensors producing energy data about every 5 minutes, but at random times between 1 and 15 minutes. I can't change that as that's the way the hardware of the sensors works. These feed into another unit, which accumulates them and forwards them in batches over the Internet to my PostgreSQL database server every few minutes (again at random times outside my control and with random batch sizes). To make things worse, if the Internet connection between the unit and the database server fails, it will send the latest data first to provide a quick update to the current values and then send the backlog of stored values. Thus, data do not always arrive in correct time order. I'm stuck home with flu, so I'm happy to help ;) I'll build an example setup to make it clearer... -- A list of all sensors create table sensors( sensor_id integer primary key ); insert into sensors select generate_series(1,100); -- A table to contain raw sensor data create table log( sensor_id integer not null references sensors(sensor_id), time integer not null, value float not null ); -- Fill it up with test data insert into log select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(0,5,5) n cross join sensors ) d; -- index it alter table log add primary key( time, sensor_id ); create index log_sensor_time on log( sensor_id, time ); select * from log where sensor_id=1 order by time; sensor_id | time | value ---+---+--- 1 |12 |12 1 |14 |14 1 |21 |21 1 |29 |29 1 |30 |30 () 1 | 49996 | 49996 1 | 50001 | 50001 -- create a table which will contain the time ticks -- which will be used as x-axis for interpolation -- (in this example, one tick every 10 time units) create table ticks( time integer primary key, check( time%10 = 0 ) ); insert into ticks select generate_series( 0, (select max(time) from log), 10 ); -- create interpolated values table create table interp( sensor_id integer not null references sensors( sensor_id ), time integer not null references ticks( time ), value float, distance integer not null ); -- fill interpolated values table -- (pretty slow) insert into interp select sensor_id, t.time, start_value + (end_value-start_value)*(t.time-start_time)/(end_time-start_time), greatest( t.time - start_time, end_time-t.time ) from (select sensor_id, lag(time) over (partition by sensor_id order by time) as start_time, time as end_time, lag(value) over (partition by sensor_id order by time) as start_value, value as end_value from log ) as l join ticks t on (t.time >= start_time and t.time < end_time); -- alternate query if you don't like the ticks table (same sesult) : insert into interp select sensor_id, time, start_value + (end_value-start_value)*(time-start_time)/(end_time-start_time), greatest( time - start_time, end_time-time ) from (select *, generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS time from (select sensor_id, lag(time) over (partition by sensor_id order by time) as start_time, time as end_time, lag(value) over (partition by sensor_id order by time) as start_value, value as end_value from log ) as l ) l; alter table interp add primary key( time,sensor_id ); create index interp_sensor_time on interp( sensor_id, time ); For each interval in the log table that contains a time tick, this query generates the interpolated data at that tick. Note that the "distance" field represents the distance (in time) between the interpolated value and the farthest real data point that was used to calculate it. Therefore, it can be used as a measure of the quality of the interpolated point ; if the distance is greater than some threshold, the value might not be that precise. Now, suppose we receive a bunch of data. The data isn't ordered according to time. There are two possibilities : - the new data starts right where we left off (ie, just after the last time for each sensor in table log) - the new data starts later in time, and we want to process the results right away, expecting to receive, at some later point, older data to fill the holes The second one is hairier, lets' do that. Anyway, let's create a packet : -- A table to contain raw sensor data create temporary table packet( sensor_id integer not null, time integer not null, value float not null ); -- Fill it up with test data insert into packet select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(50200,50400) n cross join sensors ) d; Note that I deliberately inserted a hole : the log table con
Re: [PERFORM] Duplicate deletion optimizations
Yes, but it should become a bit slower if you fix your code :-) where t_imp.id is null and test.id=t_imp.id; => where t_imp.id is not null and test.id=t_imp.id; and a partial index on matching rows might help (should be tested): (after the first updat) create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is not null. regards, Marc Mamin -Ursprüngliche Nachricht- Von: pgsql-performance-ow...@postgresql.org im Auftrag von Jochen Erwied Gesendet: Sa 1/7/2012 12:57 An: anto...@inaps.org Cc: pgsql-performance@postgresql.org Betreff: Re: [PERFORM] Duplicate deletion optimizations Friday, January 6, 2012, 4:21:06 PM you wrote: >> Every 5 minutes, a process have to insert a few thousand of rows in this >> table, but sometime, the process have to insert an already existing row >> (based on values in the triplet (t_value, t_record, output_id). In this >> case, the row must be updated with the new count value. I've tried some >> solution given on this stackoverflow question [1] but the insertion rate >> is always too low for my needs. I did check the following in a loop, starting with an empty table, and inserting/updating 5 random unique entries. After 15 minutes I've got about 10 million records, each loop takes about 3 seconds. After 30 minutes the table contains approx. 18 million entries, time per loop only slightly increased. After 90 minutes the database has about 30 million entries. The speed has dropped to about 15-20 seconds per loop, but the server is doing lots of other queries in parallel, so with an unloaded server the updates should still take less than 10 seconds. The generator runs in perl, and generates records for a maximum of 100 million different entries: use strict; srand time; my $i = 0; open FD, ">data.in"; for (1..5) { $i += rand(2000); print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000)); } close FD; The SQL-script looks like this: \timing on begin; create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint); \copy t_imp (t_value, t_record, output_id, count) from 'data.in' --an index is not really needed, table is in memory anyway --create index t_imp_ix on t_imp(t_value,t_record,output_id); -- find matching rows update t_imp set id=test.id from test where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id); -- update matching rows using primary key update test set count=t_imp.count from t_imp where t_imp.id is null and test.id=t_imp.id; -- insert missing rows insert into test(t_value,t_record,output_id,count) select t_value,t_record,output_id,count from t_imp where id is null; commit; Advantages of this solution: - all updates are done in-place, no index modifications (except for the inserts, of course) - big table only gets inserts - no dead tuples from deletes - completely avoids sequential scans on the big table Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 GHz, table and indices stored on a SSD) Table statistics: relid | 14332525 schemaname| public relname | test seq_scan | 8 seq_tup_read | 111541821 idx_scan | 149240169 idx_tup_fetch | 117901695 n_tup_ins | 30280175 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup| 30264431 n_dead_tup| 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-01-07 12:38:49.593651+01 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 31 The sequential scans were from some 'select count(*)' in between. HTH. -- Jochen Erwied | home: joc...@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: j...@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erw...@vodafone.de +49-173-5404164 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Duplicate deletion optimizations
If solution with temp table is acceptable - i think steps could be reduced... • copy to temp_imp ( temp table does not have id column) • update live set count = temp_imp.count from temp_imp using ( col1,col2,col3) • insert into live from temp where col1, col2 and col3 not exists in live Kind Regards, Misa Sent from my Windows Phone From: Jochen Erwied Sent: 07/01/2012 12:58 To: anto...@inaps.org Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Duplicate deletion optimizations Friday, January 6, 2012, 4:21:06 PM you wrote: >> Every 5 minutes, a process have to insert a few thousand of rows in this >> table, but sometime, the process have to insert an already existing row >> (based on values in the triplet (t_value, t_record, output_id). In this >> case, the row must be updated with the new count value. I've tried some >> solution given on this stackoverflow question [1] but the insertion rate >> is always too low for my needs. I did check the following in a loop, starting with an empty table, and inserting/updating 5 random unique entries. After 15 minutes I've got about 10 million records, each loop takes about 3 seconds. After 30 minutes the table contains approx. 18 million entries, time per loop only slightly increased. After 90 minutes the database has about 30 million entries. The speed has dropped to about 15-20 seconds per loop, but the server is doing lots of other queries in parallel, so with an unloaded server the updates should still take less than 10 seconds. The generator runs in perl, and generates records for a maximum of 100 million different entries: use strict; srand time; my $i = 0; open FD, ">data.in"; for (1..5) { $i += rand(2000); print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000)); } close FD; The SQL-script looks like this: \timing on begin; create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint); \copy t_imp (t_value, t_record, output_id, count) from 'data.in' --an index is not really needed, table is in memory anyway --create index t_imp_ix on t_imp(t_value,t_record,output_id); -- find matching rows update t_imp set id=test.id from test where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id); -- update matching rows using primary key update test set count=t_imp.count from t_imp where t_imp.id is null and test.id=t_imp.id; -- insert missing rows insert into test(t_value,t_record,output_id,count) select t_value,t_record,output_id,count from t_imp where id is null; commit; Advantages of this solution: - all updates are done in-place, no index modifications (except for the inserts, of course) - big table only gets inserts - no dead tuples from deletes - completely avoids sequential scans on the big table Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 GHz, table and indices stored on a SSD) Table statistics: relid | 14332525 schemaname| public relname | test seq_scan | 8 seq_tup_read | 111541821 idx_scan | 149240169 idx_tup_fetch | 117901695 n_tup_ins | 30280175 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup| 30264431 n_dead_tup| 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | 2012-01-07 12:38:49.593651+01 vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 31 The sequential scans were from some 'select count(*)' in between. HTH. -- Jochen Erwied | home: joc...@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: j...@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erw...@vodafone.de +49-173-5404164 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Duplicate deletion optimizations
Saturday, January 7, 2012, 3:02:10 PM you wrote: > • insert into live from temp where col1, col2 and col3 not exists in > live 'not exists' is something I'm trying to avoid, even if the optimizer is able to handle it. -- Jochen Erwied | home: joc...@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: j...@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erw...@vodafone.de +49-173-5404164 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Duplicate deletion optimizations
It's a fairly tricky problem. I have a number of sensors producing energy data about every 5 minutes, but at random times between 1 and 15 minutes. I can't change that as that's the way the hardware of the sensors works. These feed into another unit, which accumulates them and forwards them in batches over the Internet to my PostgreSQL database server every few minutes (again at random times outside my control and with random batch sizes). To make things worse, if the Internet connection between the unit and the database server fails, it will send the latest data first to provide a quick update to the current values and then send the backlog of stored values. Thus, data do not always arrive in correct time order. I'm stuck home with flu, so I'm happy to help ;) I'll build an example setup to make it clearer... -- A list of all sensors create table sensors( sensor_id integer primary key ); insert into sensors select generate_series(1,100); -- A table to contain raw sensor data create table log( sensor_id integer not null references sensors(sensor_id), time integer not null, value float not null ); -- Fill it up with test data insert into log select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(0,5,5) n cross join sensors ) d; -- index it alter table log add primary key( time, sensor_id ); create index log_sensor_time on log( sensor_id, time ); select * from log where sensor_id=1 order by time; sensor_id | time | value ---+---+--- 1 |12 |12 1 |14 |14 1 |21 |21 1 |29 |29 1 |30 |30 () 1 | 49996 | 49996 1 | 50001 | 50001 -- create a table which will contain the time ticks -- which will be used as x-axis for interpolation -- (in this example, one tick every 10 time units) create table ticks( time integer primary key, check( time%10 = 0 ) ); insert into ticks select generate_series( 0, (select max(time) from log), 10 ); -- create interpolated values table create table interp( sensor_id integer not null references sensors( sensor_id ), time integer not null references ticks( time ), value float, distance integer not null ); -- fill interpolated values table -- (pretty slow) insert into interp select sensor_id, t.time, start_value + (end_value-start_value)*(t.time-start_time)/(end_time-start_time), greatest( t.time - start_time, end_time-t.time ) from (select sensor_id, lag(time) over (partition by sensor_id order by time) as start_time, time as end_time, lag(value) over (partition by sensor_id order by time) as start_value, value as end_value from log ) as l join ticks t on (t.time >= start_time and t.time < end_time); -- alternate query if you don't like the ticks table (same sesult) : insert into interp select sensor_id, time, start_value + (end_value-start_value)*(time-start_time)/(end_time-start_time), greatest( time - start_time, end_time-time ) from (select *, generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS time from (select sensor_id, lag(time) over (partition by sensor_id order by time) as start_time, time as end_time, lag(value) over (partition by sensor_id order by time) as start_value, value as end_value from log ) as l ) l; alter table interp add primary key( time,sensor_id ); create index interp_sensor_time on interp( sensor_id, time ); For each interval in the log table that contains a time tick, this query generates the interpolated data at that tick. Note that the "distance" field represents the distance (in time) between the interpolated value and the farthest real data point that was used to calculate it. Therefore, it can be used as a measure of the quality of the interpolated point ; if the distance is greater than some threshold, the value might not be that precise. Now, suppose we receive a bunch of data. The data isn't ordered according to time. There are two possibilities : - the new data starts right where we left off (ie, just after the last time for each sensor in table log) - the new data starts later in time, and we want to process the results right away, expecting to receive, at some later point, older data to fill the holes The second one is hairier, lets' do that. Anyway, let's create a packet : -- A table to contain raw sensor data create temporary table packet( sensor_id integer not null, time integer not null, value float not null ); -- Fill it up with test data insert into packet select sensor_id, time, time from ( select distinct sensor_id, (n+random()*10)::INTEGER as time from generate_series(50200,50400) n
Re: [PERFORM] Duplicate deletion optimizations
On Fri, Jan 6, 2012 at 6:35 AM, wrote: > Hello, > > I've a table with approximately 50 million rows with a schema like this: > > id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), > t_value integer NOT NULL DEFAULT 0, > t_record integer NOT NULL DEFAULT 0, > output_id integer NOT NULL DEFAULT 0, > count bigint NOT NULL DEFAULT 0, > CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) > > Every 5 minutes, a process have to insert a few thousand of rows in this > table, > but sometime, the process have to insert an already existing row (based on > values in the triplet (t_value, t_record, output_id). In this case, the row > must be updated with the new count value. I've tried some solution given on > this > stackoverflow question [1] but the insertion rate is always too low for my > needs. What are your needs? It should take no special hardware or coding to be able to manage a few thousand rows over 5 minutes. > So, I've decided to do it in two times: > > - I insert all my new data with a COPY command > - When it's done, I run a delete query to remove oldest duplicates > > Right now, my delete query look like this: > > SELECT min(id) FROM stats_5mn > GROUP BY t_value, t_record, output_id > HAVING count(*) > 1; > > The duration of the query on my test machine with approx. 16 million rows is > ~18s. > > To reduce this duration, I've tried to add an index on my triplet: > > CREATE INDEX test > ON stats_5mn > USING btree > (t_value , t_record , output_id ); > > By default, the PostgreSQL planner doesn't want to use my index and do a > sequential > scan [2], but if I force it with "SET enable_seqscan = off", the index is > used [3] > and query duration is lowered to ~5s. > > > My questions: > > - Why the planner refuse to use my index? It thinks that using the index will be about 9 times more expensive than the full scan. Probably your settings for seq_page_cost and random_page_cost are such that the planner thinks that nearly every buffer read is going to be from disk. But in reality (in this case) your data is all in memory. So the planner is mis-estimating. (It would help verify this if you did your EXPLAIN ANALYZE with BUFFERS as well). But before trying to fix this by tweaking settings, will the real case always be like your test case? If the data stops being all in memory, either because the problem size increases or because you have to compete for buffer space with other things going on, then using the index scan could be catastrophic. Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Duplicate deletion optimizations
It was not query... Just sentence where some index values in one table not exist in another... So query could be with: • WHERE (col1,col2,col2) NOT IN • WHERE NOT EXISTS • LEFT JOIN live USING (col1,col2,col2) WHERE live.id IS NULL what ever whoever prefer more or what gives better results... But I think it is more personal feelings which is better then real... Sent from my Windows Phone From: Jochen Erwied Sent: 07/01/2012 15:18 To: Misa Simic Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Duplicate deletion optimizations Saturday, January 7, 2012, 3:02:10 PM you wrote: > • insert into live from temp where col1, col2 and col3 not exists in > live 'not exists' is something I'm trying to avoid, even if the optimizer is able to handle it. -- Jochen Erwied | home: joc...@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: j...@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erw...@vodafone.de +49-173-5404164 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Duplicate deletion optimizations
Hi Pierre! On 7 January 2012 12:20, Pierre C wrote: > I'm stuck home with flu, so I'm happy to help ;) [...] > I'll build an example setup to make it clearer... [...] That's almost identical to my tables. :-) > Note that the "distance" field represents the distance (in time) between the > interpolated value and the farthest real data point that was used to > calculate it. Therefore, it can be used as a measure of the quality of the > interpolated point ; if the distance is greater than some threshold, the > value might not be that precise. Nice idea! > Although this query is huge, it's very fast, since it doesn't hit the big > tables with any seq scans (hence the max() and min() tricks to use the > indexes instead). And it can easily be tamed by putting parts of it into stored pgpsql functions. > I love how postgres can blast that huge pile of SQL in, like, 50 ms... Yes, indeed. It's incredible fast. Brilliant! > If there is some overlap between packet data and data already in the log, > you might get some division by zero errors, in this case you'll need to > apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION, > which might be wiser anyway...) I do have a unique constraint on the actual table to prevent duplicate data in case of retransmission after a failed connect. It's easy enough to delete the rows from packet that already exist in the main table with a short one line SQL delete statement before the interpolation and merge. > Tada. :- > Enjoy ! I certainly will. Many thanks for those great lines of SQL! Hope you recover from your flu quickly! All the best, Marc -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance