Re: Slow query and wrong row estimates for CTE

Yoan SULTAN Wed, 17 Feb 2021 20:33:49 -0800

*You are totally right, the  max(score_value) FILTER (WHERE score_name =
'student_performance_index')  in the SELECT clause is redundant.*


Le mer. 17 févr. 2021 à 21:33, Dane Foster <studdu...@gmail.com> a écrit :

> On Wed, Feb 17, 2021 at 1:37 PM Yoan SULTAN <yey...@gmail.com> wrote:
>
>> *Hi all, *
>>
>> *This is my first post on this mailing list, I really enjoy it.*
>> *I wanted to add some details and answers to this disccusion.*
>>
> I'm happy you've decided to join the conversation and about the fact that
> you've opened up an entirely new avenue for me to investigate and learn
> from. I feel like I'm about to level up my SQL-fu! 😊
>
>
>>  17 févr. 2021 à 17:52, Dane Foster <studdu...@gmail.com> a écrit :
>>
>>>
>>> A small update (see below/inline).
>>>
>>>
>>> On Tue, Feb 16, 2021 at 2:11 PM Dane Foster <studdu...@gmail.com> wrote:
>>>
>>>> Short conclusion:
>>>> Switching from CTEs to temporary tables and analyzing reduced the
>>>> runtime from 15 minutes to about 1.5 minutes.
>>>>
>>>> *The attempt_scores table is pretty big and it is called 3 times; try
>> to rewrite the query in order to reduce call to this table. for example :*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *EXPLAIN (ANALYZE, BUFFERS)WITH reports AS (  SELECT student_id,
>> assignment_id, max(score_value) FILTER (WHERE score_name =
>> 'student_performance_index'),
>> max(attempt_report_id) maxid,
>> max(score_value) spi  FROM attempt_scores  GROUP BY student_id,
>> assignment_id  HAVING max(score_value) > 0 AND max(score_value) FILTER
>> (WHERE score_name = 'student_performance_index') = max(score_value))SELECT
>> avg(spi) spi,  avg(CASE score_name WHEN 'digital_clinical_experience' THEN
>> score_value END) dce,  avg(CASE score_name WHEN
>> 'tier1_subjective_data_collection' THEN score_value END) sdcFROM
>> attempt_scores JOIN reports ON
>> reports.maxid=attempt_scores.attempt_report_id;*
>>
> Given: HAVING max(score_value) > 0 AND max(score_value) FILTER (WHERE
> score_name = 'student_performance_index') = max(score_value)
>
> Why: max(score_value) FILTER (WHERE score_name =
> 'student_performance_index') but no FILTER clause on:
> max(attempt_report_id)?
>

> Some context for my question. I'm new to aggregate expressions therefore I
> don't have a strong mental model for what's happening. So let me tell you
> what I *think* is happening and you can correct me.
>
> The new HAVING clause that you've added ensures that for each
> student/assignment pair/group that we are selecting the max spi value
> (i.e., score_name = 'student_performance_index'). Therefore, isn't the
> FILTER clause in the SELECT section redundant? And if it's *not*
> redundant then why isn't it necessary for: max(attempt_report_id)?
>


>
> *Also, I would continue to increase work_mem to 200MB until the external
>> merge is not required.*
>> *SET WORK_MEM='200MB'; -- to change only at session level*
>>
>>>
>>>> Longer conclusion:
>>>>
>>>> @Justin Pryzby <pry...@telsasoft.com>
>>>>
>>>>    - I experimented w/ materializing the CTEs and it helped at the
>>>>    margins but did not significantly contribute to a reduction in runtime.
>>>>    - No clustering was required because once I switched to temporary
>>>>    tables the new plan no longer used the for_upsert index.
>>>>
>>>> @Michael Lewis <mle...@entrata.com>
>>>>
>>>>    - Increasing work_mem to 100MB (up from 4MB) helped at the margins
>>>>    (i.e., some 100's of millisecond improvement) but did not represent a
>>>>    significant reduction in the runtime.
>>>>    - It wasn't obvious to me which window function would be
>>>>    appropriate for the problem I was trying to solve therefore I didn't
>>>>    experiment w/ that approach.
>>>>
>>>> I want to update/correct this statement:
>>>
>>>>
>>>>    - The selectivity of score_name='student_performance_index' was not
>>>>    enough for the planner to choose an index over doing a FTS.
>>>>
>>>> I added a partial index (WHERE
>>> score_name='student_performance_index'::citext) and that had a
>>> *dramatic* impact. That part of the query went from ~12 seconds to ~1
>>> second.
>>>
>>
>> *Another way to generate perf. gains on this query,  CREATE HASH INDEX ON
>> attempt_scores(score_name); --since score_name doesn't seem to have a big
>> cardinality*
>>
>> Finally, thank you both for helping me bring this poor performing query
>>>> to heel. Your insights were helpful and greatly appreciated.
>>>>
>>>> Sincerely,
>>>>
>>>> Dane
>>>>
>>>>
>>>> On Tue, Feb 16, 2021 at 10:25 AM Dane Foster <studdu...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Tue, Feb 16, 2021 at 10:13 AM Michael Lewis <mle...@entrata.com>
>>>>> wrote:
>>>>>
>>>>>>                                                    Sort Method:
>>>>>>> external merge  Disk: 30760kB
>>>>>>>                                                    Worker 0:  Sort
>>>>>>> Method: external merge  Disk: 30760kB
>>>>>>>                                                    Worker 1:  Sort
>>>>>>> Method: external merge  Disk: 30760kB
>>>>>>>
>>>>>>
>>>>>> If you can increase work_mem, even setting it temporarily higher for
>>>>>> the session or transaction, that may dramatically change the plan.
>>>>>>
>>>>> I will try increasing work_mem for the session later today.
>>>>>
>>>>>> The advice given by Justin particularly about row estimates would be
>>>>>> wise to pursue.
>>>>>>
>>>>>
>>>>>
>>>>>> I'd wonder how selective that condition of score_name =
>>>>>> 'student_performance_index' is in filtering out many of the 9.3
>>>>>> million tuples in that table and if an index with that as the leading
>>>>>> column, or just an index on that column would be helpful.
>>>>>>
>>>>> There are 1,206,355 rows where score_name='student_performance_idex'.
>>>>>
>>>>>> You'd need to look at pg_stats for the table and see how many
>>>>>> distinct values, and if student_performance_index is relatively high or 
>>>>>> low
>>>>>> (or not present) in the MCVs list.
>>>>>>
>>>>> I will look into that.
>>>>>
>>>>>
>>>>>> I am not sure if your query does what you want it to do as I admit I
>>>>>> didn't follow your explanation of the desired behavior. My hunch is that
>>>>>> you want to make use of a window function and get rid of one of the CTEs.
>>>>>>
>>>>> If you could tell me what part(s) are unclear I would appreciate it so
>>>>> that I can write a better comment.
>>>>>
>>>>> Thank you sooo much for all the feedback. It is *greatly* appreciated!
>>>>> Sincerely,
>>>>>
>>>>> Dane
>>>>>
>>>>>
>>
>> --
>> Regards,
>> Yo.
>>
> Again, thanks for joining the conversation. I look forward to hearing from
> you.
>
> Sincerely,
>
> Dane
>


-- 
Regards,
Yo.

Re: Slow query and wrong row estimates for CTE

Reply via email to