Re: [GENERAL] Most efficient way to insert without duplicates

Amador Alvarez Wed, 24 Apr 2013 09:40:57 -0700

I would also give it a try on turning on statistics on service_id and
follower_id fields and tune collecting of distinct values for the optimizer.


Cheers,

Amador A.


On Wed, Apr 24, 2013 at 9:04 AM, François Beausoleil
<franc...@teksol.info>wrote:

> Hi!
>
> Le 2013-04-17 à 14:15, Jeff Janes a écrit :
>
> On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <franc...@teksol.info
> > wrote:
>
>>
>>
>>  Insert on public.persona_followers  (cost=139261.12..20483497.65
>> rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
>>    Buffers: shared hit=33135295 read=4776921
>>    ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498
>> width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)
>>
>
>
> It looks like 12% of the time is being spent figuring out what rows to
> insert, and 88% actually doing the insertions.
>
> So I think that index maintenance is killing you.  You could try adding a
> sort to your select so that rows are inserted in index order, or inserting
> in batches in which the batches are partitioned by service_id (which is
> almost the same thing as sorting, since service_id is the lead column)
>
>
>
> To close out the thread, the final results are in
> http://explain.depesz.com/s/xOe :
>
> Insert on public.persona_followers  (cost=149905.33..149906.58 rows=100
> width=24) (actual time=19.837..19.837 rows=0 loops=1)
>   Buffers: shared hit=206, local hit=1 read=105
>   ->  Sort  (cost=149905.33..149905.58 rows=100 width=24) (actual
> time=19.534..19.536 rows=6 loops=1)
>         Output: persona_followers_import.service_id,
> persona_followers_import.follower_id,
> (min(persona_followers_import.valid_at))
>         Sort Key: persona_followers_import.follower_id
>         Sort Method: quicksort  Memory: 25kB
>         Buffers: shared hit=176, local hit=1 read=105
>         ->  HashAggregate  (cost=149901.01..149902.01 rows=100 width=24)
> (actual time=19.514..19.526 rows=6 loops=1)
>               Output: persona_followers_import.service_id,
> persona_followers_import.follower_id, min(persona_followers_import.valid_at)
>               Buffers: shared hit=176, local hit=1 read=105
>               ->  Bitmap Heap Scan on pg_temp_35.persona_followers_import
>  (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505
> rows=6 loops=1)
>                     Output: persona_followers_import.service_id,
> persona_followers_import.follower_id, persona_followers_import.valid_at
>                     Recheck Cond: ((persona_followers_import.service_id =
> 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
>                     Filter: (NOT (hashed SubPlan 1))
>                     Buffers: shared hit=176, local hit=1 read=105
>                     ->  Bitmap Index Scan on
> persona_followers_import_service_id  (cost=0.00..1134.32 rows=44469
> width=0) (actual time=1.752..1.752 rows=10000 loops=1)
>                           Index Cond:
> ((persona_followers_import.service_id = 362513855) AND
> (persona_followers_import.follower_id IS NOT NULL))
>                           Buffers: local hit=1 read=40
>                     SubPlan 1
>                       ->  Bitmap Heap Scan on public.persona_followers
>  (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400
> rows=14317 loops=1)
>                             Output: public.persona_followers.follower_id
>                             Recheck Cond:
> (public.persona_followers.service_id = 362513855)
>                             Buffers: shared hit=176
>                             ->  Bitmap Index Scan on
> persona_followers_pkey  (cost=0.00..655.48 rows=24252 width=0) (actual
> time=2.284..2.284 rows=14317 loops=1)
>                                   Index Cond:
> (public.persona_followers.service_id = 362513855)
>                                   Buffers: shared hit=88
> Total runtime: 19.917 ms
>
> Runtime is under 20 milliseconds, per imported service_id. I have a few
> thousand such items per day, and that's fine. The final script looks like
> this:
>
> CREATE TEMPORARY TABLE persona_followers_import( service_id bigint,
> follower_id bigint );
> COPY TO persona_followers_import FROM stdin;
> ...
> \.
>
> CREATE INDEX index_persona_followers_import_on_service_id ON
> persona_followers_import(service_id, follower_id);
>
> service_ids := SELECT DISTINCT service_id FROM persona_followers_import;
> for each service_id in service_ids:
>   EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS )
>   INSERT INTO persona_followers(service_id, follower_id, valid_at)
>     SELECT service_id, follower_id, MIN(valid_at)
>     FROM persona_followers_import
>     WHERE follower_id IS NOT NULL
>       AND follower_id NOT IN (SELECT follower_id FROM persona_followers
> WHERE service_id = :service_id)
>       AND service_id = :service_id
>     GROUP BY service_id, follower_id
>     ORDER BY follower_id
>
> This seems to give me the best possible throughput. I was able to import
> days of data in an hour, compared to hours of work for one day of data.
>
> Thanks for all suggestions, and PostgreSQL rocks!
> François Beausoleil
>

Re: [GENERAL] Most efficient way to insert without duplicates

Reply via email to