On Wed, Feb 17, 2021 at 1:37 PM Yoan SULTAN <yey...@gmail.com> wrote:
> *Hi all, * > > *This is my first post on this mailing list, I really enjoy it.* > *I wanted to add some details and answers to this disccusion.* > I'm happy you've decided to join the conversation and about the fact that you've opened up an entirely new avenue for me to investigate and learn from. I feel like I'm about to level up my SQL-fu! 😊 > 17 févr. 2021 à 17:52, Dane Foster <studdu...@gmail.com> a écrit : > >> >> A small update (see below/inline). >> >> >> On Tue, Feb 16, 2021 at 2:11 PM Dane Foster <studdu...@gmail.com> wrote: >> >>> Short conclusion: >>> Switching from CTEs to temporary tables and analyzing reduced the >>> runtime from 15 minutes to about 1.5 minutes. >>> >>> *The attempt_scores table is pretty big and it is called 3 times; try to > rewrite the query in order to reduce call to this table. for example :* > > > > > > > > > > > > > > > *EXPLAIN (ANALYZE, BUFFERS)WITH reports AS ( SELECT student_id, > assignment_id, max(score_value) FILTER (WHERE score_name = > 'student_performance_index'), > max(attempt_report_id) maxid, > max(score_value) spi FROM attempt_scores GROUP BY student_id, > assignment_id HAVING max(score_value) > 0 AND max(score_value) FILTER > (WHERE score_name = 'student_performance_index') = max(score_value))SELECT > avg(spi) spi, avg(CASE score_name WHEN 'digital_clinical_experience' THEN > score_value END) dce, avg(CASE score_name WHEN > 'tier1_subjective_data_collection' THEN score_value END) sdcFROM > attempt_scores JOIN reports ON > reports.maxid=attempt_scores.attempt_report_id;* > Given: HAVING max(score_value) > 0 AND max(score_value) FILTER (WHERE score_name = 'student_performance_index') = max(score_value) Why: max(score_value) FILTER (WHERE score_name = 'student_performance_index') but no FILTER clause on: max(attempt_report_id)? Some context for my question. I'm new to aggregate expressions therefore I don't have a strong mental model for what's happening. So let me tell you what I *think* is happening and you can correct me. The new HAVING clause that you've added ensures that for each student/assignment pair/group that we are selecting the max spi value (i.e., score_name = 'student_performance_index'). Therefore, isn't the FILTER clause in the SELECT section redundant? And if it's *not* redundant then why isn't it necessary for: max(attempt_report_id)? *Also, I would continue to increase work_mem to 200MB until the external > merge is not required.* > *SET WORK_MEM='200MB'; -- to change only at session level* > >> >>> Longer conclusion: >>> >>> @Justin Pryzby <pry...@telsasoft.com> >>> >>> - I experimented w/ materializing the CTEs and it helped at the >>> margins but did not significantly contribute to a reduction in runtime. >>> - No clustering was required because once I switched to temporary >>> tables the new plan no longer used the for_upsert index. >>> >>> @Michael Lewis <mle...@entrata.com> >>> >>> - Increasing work_mem to 100MB (up from 4MB) helped at the margins >>> (i.e., some 100's of millisecond improvement) but did not represent a >>> significant reduction in the runtime. >>> - It wasn't obvious to me which window function would be appropriate >>> for the problem I was trying to solve therefore I didn't experiment w/ >>> that >>> approach. >>> >>> I want to update/correct this statement: >> >>> >>> - The selectivity of score_name='student_performance_index' was not >>> enough for the planner to choose an index over doing a FTS. >>> >>> I added a partial index (WHERE >> score_name='student_performance_index'::citext) and that had a *dramatic* >> impact. That part of the query went from ~12 seconds to ~1 second. >> > > *Another way to generate perf. gains on this query, CREATE HASH INDEX ON > attempt_scores(score_name); --since score_name doesn't seem to have a big > cardinality* > > Finally, thank you both for helping me bring this poor performing query to >>> heel. Your insights were helpful and greatly appreciated. >>> >>> Sincerely, >>> >>> Dane >>> >>> >>> On Tue, Feb 16, 2021 at 10:25 AM Dane Foster <studdu...@gmail.com> >>> wrote: >>> >>>> >>>> On Tue, Feb 16, 2021 at 10:13 AM Michael Lewis <mle...@entrata.com> >>>> wrote: >>>> >>>>> Sort Method: >>>>>> external merge Disk: 30760kB >>>>>> Worker 0: Sort >>>>>> Method: external merge Disk: 30760kB >>>>>> Worker 1: Sort >>>>>> Method: external merge Disk: 30760kB >>>>>> >>>>> >>>>> If you can increase work_mem, even setting it temporarily higher for >>>>> the session or transaction, that may dramatically change the plan. >>>>> >>>> I will try increasing work_mem for the session later today. >>>> >>>>> The advice given by Justin particularly about row estimates would be >>>>> wise to pursue. >>>>> >>>> >>>> >>>>> I'd wonder how selective that condition of score_name = >>>>> 'student_performance_index' is in filtering out many of the 9.3 >>>>> million tuples in that table and if an index with that as the leading >>>>> column, or just an index on that column would be helpful. >>>>> >>>> There are 1,206,355 rows where score_name='student_performance_idex'. >>>> >>>>> You'd need to look at pg_stats for the table and see how many distinct >>>>> values, and if student_performance_index is relatively high or low (or not >>>>> present) in the MCVs list. >>>>> >>>> I will look into that. >>>> >>>> >>>>> I am not sure if your query does what you want it to do as I admit I >>>>> didn't follow your explanation of the desired behavior. My hunch is that >>>>> you want to make use of a window function and get rid of one of the CTEs. >>>>> >>>> If you could tell me what part(s) are unclear I would appreciate it so >>>> that I can write a better comment. >>>> >>>> Thank you sooo much for all the feedback. It is *greatly* appreciated! >>>> Sincerely, >>>> >>>> Dane >>>> >>>> > > -- > Regards, > Yo. > Again, thanks for joining the conversation. I look forward to hearing from you. Sincerely, Dane