Re: Slow query and wrong row estimates for CTE

Dane Foster Wed, 17 Feb 2021 12:33:18 -0800

On Wed, Feb 17, 2021 at 1:37 PM Yoan SULTAN <yey...@gmail.com> wrote:


> *Hi all, *
>
> *This is my first post on this mailing list, I really enjoy it.*
> *I wanted to add some details and answers to this disccusion.*
>
I'm happy you've decided to join the conversation and about the fact that
you've opened up an entirely new avenue for me to investigate and learn
from. I feel like I'm about to level up my SQL-fu! 😊


>  17 févr. 2021 à 17:52, Dane Foster <studdu...@gmail.com> a écrit :
>
>>
>> A small update (see below/inline).
>>
>>
>> On Tue, Feb 16, 2021 at 2:11 PM Dane Foster <studdu...@gmail.com> wrote:
>>
>>> Short conclusion:
>>> Switching from CTEs to temporary tables and analyzing reduced the
>>> runtime from 15 minutes to about 1.5 minutes.
>>>
>>> *The attempt_scores table is pretty big and it is called 3 times; try to
> rewrite the query in order to reduce call to this table. for example :*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *EXPLAIN (ANALYZE, BUFFERS)WITH reports AS (  SELECT student_id,
> assignment_id, max(score_value) FILTER (WHERE score_name =
> 'student_performance_index'),
> max(attempt_report_id) maxid,
> max(score_value) spi  FROM attempt_scores  GROUP BY student_id,
> assignment_id  HAVING max(score_value) > 0 AND max(score_value) FILTER
> (WHERE score_name = 'student_performance_index') = max(score_value))SELECT
> avg(spi) spi,  avg(CASE score_name WHEN 'digital_clinical_experience' THEN
> score_value END) dce,  avg(CASE score_name WHEN
> 'tier1_subjective_data_collection' THEN score_value END) sdcFROM
> attempt_scores JOIN reports ON
> reports.maxid=attempt_scores.attempt_report_id;*
>
Given: HAVING max(score_value) > 0 AND max(score_value) FILTER (WHERE
score_name = 'student_performance_index') = max(score_value)

Why: max(score_value) FILTER (WHERE score_name =
'student_performance_index') but no FILTER clause on:
max(attempt_report_id)?

Some context for my question. I'm new to aggregate expressions therefore I
don't have a strong mental model for what's happening. So let me tell you
what I *think* is happening and you can correct me.

The new HAVING clause that you've added ensures that for each
student/assignment pair/group that we are selecting the max spi value
(i.e., score_name = 'student_performance_index'). Therefore, isn't the
FILTER clause in the SELECT section redundant? And if it's *not* redundant
then why isn't it necessary for: max(attempt_report_id)?


*Also, I would continue to increase work_mem to 200MB until the external
> merge is not required.*
> *SET WORK_MEM='200MB'; -- to change only at session level*
>
>>
>>> Longer conclusion:
>>>
>>> @Justin Pryzby <pry...@telsasoft.com>
>>>
>>>    - I experimented w/ materializing the CTEs and it helped at the
>>>    margins but did not significantly contribute to a reduction in runtime.
>>>    - No clustering was required because once I switched to temporary
>>>    tables the new plan no longer used the for_upsert index.
>>>
>>> @Michael Lewis <mle...@entrata.com>
>>>
>>>    - Increasing work_mem to 100MB (up from 4MB) helped at the margins
>>>    (i.e., some 100's of millisecond improvement) but did not represent a
>>>    significant reduction in the runtime.
>>>    - It wasn't obvious to me which window function would be appropriate
>>>    for the problem I was trying to solve therefore I didn't experiment w/ 
>>> that
>>>    approach.
>>>
>>> I want to update/correct this statement:
>>
>>>
>>>    - The selectivity of score_name='student_performance_index' was not
>>>    enough for the planner to choose an index over doing a FTS.
>>>
>>> I added a partial index (WHERE
>> score_name='student_performance_index'::citext) and that had a *dramatic*
>> impact. That part of the query went from ~12 seconds to ~1 second.
>>
>
> *Another way to generate perf. gains on this query,  CREATE HASH INDEX ON
> attempt_scores(score_name); --since score_name doesn't seem to have a big
> cardinality*
>
> Finally, thank you both for helping me bring this poor performing query to
>>> heel. Your insights were helpful and greatly appreciated.
>>>
>>> Sincerely,
>>>
>>> Dane
>>>
>>>
>>> On Tue, Feb 16, 2021 at 10:25 AM Dane Foster <studdu...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> On Tue, Feb 16, 2021 at 10:13 AM Michael Lewis <mle...@entrata.com>
>>>> wrote:
>>>>
>>>>>                                                    Sort Method:
>>>>>> external merge  Disk: 30760kB
>>>>>>                                                    Worker 0:  Sort
>>>>>> Method: external merge  Disk: 30760kB
>>>>>>                                                    Worker 1:  Sort
>>>>>> Method: external merge  Disk: 30760kB
>>>>>>
>>>>>
>>>>> If you can increase work_mem, even setting it temporarily higher for
>>>>> the session or transaction, that may dramatically change the plan.
>>>>>
>>>> I will try increasing work_mem for the session later today.
>>>>
>>>>> The advice given by Justin particularly about row estimates would be
>>>>> wise to pursue.
>>>>>
>>>>
>>>>
>>>>> I'd wonder how selective that condition of score_name =
>>>>> 'student_performance_index' is in filtering out many of the 9.3
>>>>> million tuples in that table and if an index with that as the leading
>>>>> column, or just an index on that column would be helpful.
>>>>>
>>>> There are 1,206,355 rows where score_name='student_performance_idex'.
>>>>
>>>>> You'd need to look at pg_stats for the table and see how many distinct
>>>>> values, and if student_performance_index is relatively high or low (or not
>>>>> present) in the MCVs list.
>>>>>
>>>> I will look into that.
>>>>
>>>>
>>>>> I am not sure if your query does what you want it to do as I admit I
>>>>> didn't follow your explanation of the desired behavior. My hunch is that
>>>>> you want to make use of a window function and get rid of one of the CTEs.
>>>>>
>>>> If you could tell me what part(s) are unclear I would appreciate it so
>>>> that I can write a better comment.
>>>>
>>>> Thank you sooo much for all the feedback. It is *greatly* appreciated!
>>>> Sincerely,
>>>>
>>>> Dane
>>>>
>>>>
>
> --
> Regards,
> Yo.
>
Again, thanks for joining the conversation. I look forward to hearing from
you.

Sincerely,

Dane

Re: Slow query and wrong row estimates for CTE

Reply via email to