Slow join

2018-06-25 Thread Roman Kushnir
Hello,

The following basic inner join is taking too much time for me. (I’m using 
count(videos.id ) instead of count(*) because my actual 
query looks different, but I simplified it here to the essence).
I’ve tried following random people's suggestions and adjusting the 
random_page_cost(decreasing it from 4 to 1.1) without a stable improvement. Any 
hints on what is wrong here? Thank you.

The query

SELECT COUNT(videos.id) FROM videos JOIN accounts ON accounts.channel_id = 
videos.channel_id;

The accounts table has 744 rows, videos table has 2.2M rows, the join produces 
135k rows.

Running on Amazon RDS, with default 10.1 parameters

 version
-
 PostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 
(Red Hat 4.8.2-16), 64-bit

Execution plan https://explain.depesz.com/s/gf7 


Structure and statistics of the tables involved

=> \d videos
  Table "public.videos"
 Column |Type | Collation | Nullable |  
Default
+-+---+--+---
 id | bigint  |   | not null | 
nextval('videos_id_seq'::regclass)
 vendor_id  | character varying   |   | not null |
 channel_id | bigint  |   |  |
 published_at   | timestamp without time zone |   |  |
 title  | text|   |  |
 description| text|   |  |
 thumbnails | jsonb   |   |  |
 tags   | character varying[] |   |  |
 category_id| character varying   |   |  |
 default_language   | character varying   |   |  |
 default_audio_language | character varying   |   |  |
 duration   | integer |   |  |
 stereoscopic   | boolean |   |  |
 hd | boolean |   |  |
 captioned  | boolean |   |  |
 licensed   | boolean |   |  |
 projection | character varying   |   |  |
 privacy_status | character varying   |   |  |
 license| character varying   |   |  |
 embeddable | boolean |   |  |
 terminated_at  | timestamp without time zone |   |  |
 created_at | timestamp without time zone |   | not null |
 updated_at | timestamp without time zone |   | not null |
 featured_game_id   | bigint  |   |  |
Indexes:
"videos_pkey" PRIMARY KEY, btree (id)
"index_videos_on_vendor_id" UNIQUE, btree (vendor_id)
"index_videos_on_channel_id" btree (channel_id)
"index_videos_on_featured_game_id" btree (featured_game_id)
Foreign-key constraints:
"fk_rails_257f68ae55" FOREIGN KEY (channel_id) REFERENCES channels(id)
"fk_rails_ce1b3e10b0" FOREIGN KEY (featured_game_id) REFERENCES games(id)
Referenced by:
TABLE "video_fetch_statuses" CONSTRAINT "fk_rails_3bfdf013b8" FOREIGN KEY 
(video_id) REFERENCES videos(id)
TABLE "video_daily_facts" CONSTRAINT "fk_rails_dc0eca9ebb" FOREIGN KEY 
(video_id) REFERENCES videos(id)


=> SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, 
relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE 
relname='videos’;

relname| relpages |  reltuples  | relallvisible | relkind | 
relnatts | relhassubclass | reloptions | pg_table_size
---+--+-+---+-+--+++---
 videos|   471495 | 2.25694e+06 |471389 | r   | 
  24 | f  ||4447764480


=> SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, 
attname, n_distinct, array_length(most_common_vals,1) n_mcv, 
array_length(histogram_bounds,1) n_hist FROM pg_stats WHERE 
attname='channel_id' AND tablename='videos' ORDER BY 1 DESC;

 frac_mcv |   tablename   |attname | n_distinct | n_mcv | n_hist
--+---+++---+
   0.1704 | videos| channel_id 

Re: Slow join

2018-06-25 Thread Roman Kushnir
Hi Justin,

Thank you for your comments.

As you mentioned the size of shared buffers, my first thought was to just 
switch to a larger machine as this one only has 2 gigs of RAM. But then it 
occurred to me that the whole videos table is getting loaded into memory while 
only 2 small columns are actually used! So I created a covering index on videos 
(channel_id, id) and the query now completes in 190ms!

Thanks, you helped me a lot.


> On Jun 25, 2018, at 6:45 PM, Justin Pryzby  wrote:
> 
> Hi,
> 
> Thanks for providing all this info :)
> 
> On Mon, Jun 25, 2018 at 05:55:49PM +0200, Roman Kushnir wrote:
>> Hello,
>> 
>> The following basic inner join is taking too much time for me. (I’m using 
>> count(videos.id <http://videos.id/>) instead of count(*) because my actual 
>> query looks different, but I simplified it here to the essence).
>> I’ve tried following random people's suggestions and adjusting the 
>> random_page_cost(decreasing it from 4 to 1.1) without a stable improvement. 
>> Any hints on what is wrong here? Thank you.
> 
>> Running on Amazon RDS, with default 10.1 parameters
> 
> All default ?
> https://wiki.postgresql.org/wiki/Server_Configuration
> 
> It looks like nearly the entire time is spent reading this table:
> 
>   Parallel Seq Scan on videos ... (ACTUAL TIME=0.687..55,555.774...)
>   Buffers: shared hit=7138 read=464357
> 
> Perhaps shared_buffers should be at least several times larger, and perhaps up
> to 4gb to keep the entire table in RAM.  You could maybe also benefit from
> better device readahead (blockdev --setra or lvchange -r or
> /sys/block/sd?/queue/read_ahead_kb)
> 
> Also, it looks like there's a row count misestimate, which probably doesn't
> matter for the query you sent, but maybe affects your larger query:
>   Hash Join (... ROWS=365,328 ... ) (... ROWS=45,307 ... )
> 
> If that matters, maybe it'd help to increase statistics on channel_id.
> Actually, I see both tables have FK into channels.id:
> 
>>"fk_rails_11d6d9bea2" FOREIGN KEY (channel_id) REFERENCES channels(id)
>>"fk_rails_257f68ae55" FOREIGN KEY (channel_id) REFERENCES channels(id)
> 
> I don't see the definition of "channels" (and it looks like the query I put on
> the wiki doesn't show null_frac), but I think that postgres since 9.6 should 
> be
> able to infer good join statistics from the existence of the FKs.  Maybe that
> only works if you actually JOIN to the channels table (?).  But if anything
> that's only a 2ndary problem, if at all.
> 
> Justin



Re: Slow join

2018-06-27 Thread Roman Kushnir
Hi Laurenz,

You’re right about the table being bloated, the videos.description column is 
large. I thought about moving it to a separate table, but having an index only 
on the columns used in the query seems to have compensated for that already.
Thank you.

> On Jun 27, 2018, at 10:19 AM, Laurenz Albe  wrote:
> 
> Roman Kushnir wrote:
>> The following basic inner join is taking too much time for me. (I’m using 
>> count(videos.id)
>> instead of count(*) because my actual query looks different, but I 
>> simplified it here to the essence).
>> I’ve tried following random people's suggestions and adjusting the 
>> random_page_cost
>> (decreasing it from 4 to 1.1) without a stable improvement. Any hints on 
>> what is wrong here? Thank you.
>> 
>> The query
>> 
>> SELECT COUNT(videos.id) FROM videos JOIN accounts ON accounts.channel_id = 
>> videos.channel_id;
>> 
>> The accounts table has 744 rows, videos table has 2.2M rows, the join 
>> produces 135k rows.
>> 
>> Running on Amazon RDS, with default 10.1 parameters
>> 
>> version
>> -
>> PostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 
>> (Red Hat 4.8.2-16), 64-bit
>> 
>> Execution plan https://explain.depesz.com/s/gf7
> 
> Your time is spent here:
> 
>> ->  Parallel Seq Scan on videos  (cost=0.00..480898.90 rows=940390 width=16) 
>> (actual time=0.687..5.774 rows=764042 loops=3)
>>  Buffers: shared hit=7138 read=464357
> 
> 55 seconds to scan 3.5 GB is not so bad.
> 
> What I wonder is how it is that you have less than two rows per table block.
> Could it be that the table is very bloated?
> 
> If you can, you could "VACUUM (FULL) videos" and see if that makes a 
> difference.
> If you can bring the table size down, it will speed up query performance.
> 
> Yours,
> Laurenz Albe
> -- 
> Cybertec | https://www.cybertec-postgresql.com