Re: Bulk DML performance

Renan Alves Fonseca Mon, 17 Mar 2025 10:46:57 -0700

Hi,
Here are some observations.

Em seg., 17 de mar. de 2025 às 09:19, <bill.po...@ymail.com> escreveu:


> > PostgreSQL has a lot of overhead per row.
>
> Okay, thanks. I'm not actually too worried about this since in my
> scenario, each row is about 1.5 kB, so the % overhead is negligible.
>
> > It is probably not the lookup, but the *modification* of the index that
> is slow.
>
> Yes that makes sense for the original 3 million inserts, but when I
> perform the update of the 3 million rows, the index doesn't change - they
> are all HOT updates.
>

Using "perf" I can see that the overhead is indeed due to index lookup when
we do HOT updates.


> > Then the best you can do is to use COPY rather than INSERT. It will
> perform better (but [not] vastly better).
>
> I need to perform a merge (INSERT ... ON CONFLICT ... DO UPDATE) on the
> data, so sadly I cannot use COPY.
>
> I have discovered that for some reason, performing the original insert
> without the ON CONFLICT statement is twice as fast as performing the
> original insert with an ON CONFLICT ... DO UPDATE clause, completing in 4
> seconds instead of 8. That seems strange to me because I wouldn't have
> thought it would be doing any additional work since a unique constraint is
> on the primary key, so each inserted value would need to be checked in
> either case, and there is no extra work to be done in either case.
>
>
In the INSERT case, we do not check the unique constraint for each row. We
run into an error when inserting a duplicate, aborting the operation.


> INSERT INTO test (id, text1)
> SELECT generate_series, 'x'
> FROM generate_series(1, 3000000)
>
> It remains 4 seconds even when adding a clause to not insert duplicates.
>
> INSERT INTO test (id, text1)
> SELECT generate_series, 'x'
> FROM generate_series(1, 3000000)
> WHERE NOT EXISTS (
>   SELECT 1
>   FROM test4
>   WHERE id = generate_series
> )
>
>
In this case, we are not checking duplicates inside the input dataset. If
you can guarantee, at the application level, that there are no duplicates,
this seems a good speedup. Perhaps the MERGE clause...


> Furthermore, I have found that performing an UPDATE rather than an INSERT
> ... ON CONFLICT ... DO UPDATE is twice as slow, completing in 16 seconds
> instead of 14 seconds.
>
> UPDATE test
> SET text1 = 'x'
> FROM generate_series(1, 3000000)
> WHERE test4.id = generate_series
>
> This also now means that updating 3 million rows takes 4x longer than
> inserting those rows. Do we expect updates to be 4x slower than inserts?
>
>
It is not the update that is slower. It is the attached where clause that
makes it slower. Try:
UPDATE test SET text1='x';

In my tests, the update of non-indexed columns is slightly faster than an
insert.

Regards,
Renan Fonseca

Re: Bulk DML performance

Reply via email to