First of all, thank you for the replies.
I am using a base installation of postgres 10.10, with no modifications to
any of the system defaults.
I am trying to speedup a join between two tables: the title table and the
cast_info table.
The title table is a table containing information about different movies.
it contains 4626969 records.
the table also has a foreign key index on the cast_info table, enabling the
planner to use a hash-join.
The cast_info table is a table containing the information of which actor
was casted in which movie and contains 62039343 records.
The database also contains a materialized view ci_t_15, defined as:
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year < 2015
I am comparing two queries, q1 and q2 respectively.
Query q1 is the original query and q2 is an attempt to reduce the cost of
execution via leveraging the materialized view ci_t_15.
Query q1 is defined as:
select * from cast_info join title on cast_info.movie_id = title.id
Query q2 is defined as
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year >= 2015
UNION ALL
select * from ci_t_15
Both queries are executed on a Dell xps laptop with an I7-8750H processor
and 16 (2*8) gb ram on an SSD running on ubuntu 18.04.2 LTS.
Running explain analyze on both queries I get the following execution plans.
q1:
"Hash Join (cost=199773.80..2561662.10 rows=62155656 width=103) (actual
time=855.063..25786.264 rows=62039343 loops=1)"
" Hash Cond: (cast_info.ci_movie_id = title.t_id)"
" -> Seq Scan on cast_info (cost=0.00..1056445.56 rows=62155656
width=42) (actual time=0.027..3837.722 rows=62039343 loops=1)"
" -> Hash (cost=92232.69..92232.69 rows=4626969 width=61) (actual
time=854.548..854.548 rows=4626969 loops=1)"
"Buckets: 65536 Batches: 128 Memory Usage: 3431kB"
"-> Seq Scan on title (cost=0.00..92232.69 rows=4626969 width=61)
(actual time=0.005..327.588 rows=4626969 loops=1)"
"Planning time: 5.097 ms"
"Execution time: 27236.088 ms"
q2:
"Append (cost=123209.65..3713445.65 rows=61473488 width=105) (actual
time=442.207..29713.621 rows=60918189 loops=1)"
" -> Gather (cost=123209.65..2412792.77 rows=10639784 width=103) (actual
time=442.206..14634.427 rows=10046633 loops=1)"
"Workers Planned: 2"
"Workers Launched: 2"
"-> Hash Join (cost=122209.65..1347814.37 rows=4433243 width=103)
(actual time=471.969..12527.840 rows=3348878 loops=3)"
" Hash Cond: (cast_info.ci_movie_id = title.t_id)"
" -> Parallel Seq Scan on cast_info (cost=0.00..693870.90
rows=25898190 width=42) (actual time=0.006..7302.679 rows=20679781 loops=3)"
" -> Hash (cost=103800.11..103800.11 rows=792043 width=61)
(actual time=471.351..471.351 rows=775098 loops=3)"
"Buckets: 65536 Batches: 32 Memory Usage: 2515kB"
"-> Seq Scan on title (cost=0.00..103800.11
rows=792043 width=61) (actual time=0.009..376.127 rows=775098 loops=3)"
" Filter: (t_production_year >= 2015)"
" Rows Removed by Filter: 3851871"
" -> Seq Scan on ci_t_15 (cost=0.00..1194255.04 rows=50833704 width=105)
(actual time=1.143..11967.391 rows=50871556 loops=1)"
"Planning time: 0.268 ms"
"Execution time: 31379.854 ms"
Due to using the materialized view I can reduce the amount of records going
into the hash join, lowering the time from 25786.264 msec to 12527.840 msec.
However, this is where my question comes in, this reduction is completely
negated by the cost of appending both results in the UNION ALL command.
I was wondering if this is normal behaviour.
In my mind, I wouldn't expect appending 2 resultsets to have such a
relative huge cost associated with it.
This is also why I asked what exactly a UNION ALL does to achieve its
functionality, to perhaps gain some insight in its cost.
With kind regards,
Mark
On Thu, 15 Aug 2019 at 21:22, Ibrar Ahmed wrote:
>
>
> On Fri, Aug 16, 2019 at 12:16 AM <066ce...@free.fr> wrote:
>
>> Generally speaking, when executing UNION ; a DISTINCT is run afterward on
>> the resultset.
>>
>> So, if you're sure that each part of UNION cannot return a line returned
>> by another one, you may use UNION ALL, you'll cut the cost of the final
>> implicit DISTINCT.
>>
>>
>> - Mail original -
>> De: "Mark Pasterkamp"
>> À: pgsql-hackers@lists.postgresql.org
>> Envoyé: Jeudi 15 Août 2019 20:37:06
>> Objet: UNION ALL
>>
>>
>> Dear all,
>>
>>
>> I was wondering if someo