I would also give it a try on turning on statistics on service_id and follower_id fields and tune collecting of distinct values for the optimizer.
Cheers, Amador A. On Wed, Apr 24, 2013 at 9:04 AM, François Beausoleil <franc...@teksol.info>wrote: > Hi! > > Le 2013-04-17 à 14:15, Jeff Janes a écrit : > > On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <franc...@teksol.info > > wrote: > >> >> >> Insert on public.persona_followers (cost=139261.12..20483497.65 >> rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1) >> Buffers: shared hit=33135295 read=4776921 >> -> Subquery Scan on t1 (cost=139261.12..20483497.65 rows=6256498 >> width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1) >> > > > It looks like 12% of the time is being spent figuring out what rows to > insert, and 88% actually doing the insertions. > > So I think that index maintenance is killing you. You could try adding a > sort to your select so that rows are inserted in index order, or inserting > in batches in which the batches are partitioned by service_id (which is > almost the same thing as sorting, since service_id is the lead column) > > > > To close out the thread, the final results are in > http://explain.depesz.com/s/xOe : > > Insert on public.persona_followers (cost=149905.33..149906.58 rows=100 > width=24) (actual time=19.837..19.837 rows=0 loops=1) > Buffers: shared hit=206, local hit=1 read=105 > -> Sort (cost=149905.33..149905.58 rows=100 width=24) (actual > time=19.534..19.536 rows=6 loops=1) > Output: persona_followers_import.service_id, > persona_followers_import.follower_id, > (min(persona_followers_import.valid_at)) > Sort Key: persona_followers_import.follower_id > Sort Method: quicksort Memory: 25kB > Buffers: shared hit=176, local hit=1 read=105 > -> HashAggregate (cost=149901.01..149902.01 rows=100 width=24) > (actual time=19.514..19.526 rows=6 loops=1) > Output: persona_followers_import.service_id, > persona_followers_import.follower_id, min(persona_followers_import.valid_at) > Buffers: shared hit=176, local hit=1 read=105 > -> Bitmap Heap Scan on pg_temp_35.persona_followers_import > (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505 > rows=6 loops=1) > Output: persona_followers_import.service_id, > persona_followers_import.follower_id, persona_followers_import.valid_at > Recheck Cond: ((persona_followers_import.service_id = > 362513855) AND (persona_followers_import.follower_id IS NOT NULL)) > Filter: (NOT (hashed SubPlan 1)) > Buffers: shared hit=176, local hit=1 read=105 > -> Bitmap Index Scan on > persona_followers_import_service_id (cost=0.00..1134.32 rows=44469 > width=0) (actual time=1.752..1.752 rows=10000 loops=1) > Index Cond: > ((persona_followers_import.service_id = 362513855) AND > (persona_followers_import.follower_id IS NOT NULL)) > Buffers: local hit=1 read=40 > SubPlan 1 > -> Bitmap Heap Scan on public.persona_followers > (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400 > rows=14317 loops=1) > Output: public.persona_followers.follower_id > Recheck Cond: > (public.persona_followers.service_id = 362513855) > Buffers: shared hit=176 > -> Bitmap Index Scan on > persona_followers_pkey (cost=0.00..655.48 rows=24252 width=0) (actual > time=2.284..2.284 rows=14317 loops=1) > Index Cond: > (public.persona_followers.service_id = 362513855) > Buffers: shared hit=88 > Total runtime: 19.917 ms > > Runtime is under 20 milliseconds, per imported service_id. I have a few > thousand such items per day, and that's fine. The final script looks like > this: > > CREATE TEMPORARY TABLE persona_followers_import( service_id bigint, > follower_id bigint ); > COPY TO persona_followers_import FROM stdin; > ... > \. > > CREATE INDEX index_persona_followers_import_on_service_id ON > persona_followers_import(service_id, follower_id); > > service_ids := SELECT DISTINCT service_id FROM persona_followers_import; > for each service_id in service_ids: > EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS ) > INSERT INTO persona_followers(service_id, follower_id, valid_at) > SELECT service_id, follower_id, MIN(valid_at) > FROM persona_followers_import > WHERE follower_id IS NOT NULL > AND follower_id NOT IN (SELECT follower_id FROM persona_followers > WHERE service_id = :service_id) > AND service_id = :service_id > GROUP BY service_id, follower_id > ORDER BY follower_id > > This seems to give me the best possible throughput. I was able to import > days of data in an hour, compared to hours of work for one day of data. > > Thanks for all suggestions, and PostgreSQL rocks! > François Beausoleil >