Hello, I'm seeking guidance in how to improve the performance of a slow query and to have some other sets of eyes confirm that what I wrote does what I intend.
According to the PostgreSQL wiki there is a set of metadata that I should provide to help you help me. So let's begin there. PostgreSQL version: PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit Full table and index schema: CREATE TABLE attempt_scores ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, attempt_report_id bigint NOT NULL, score_value double precision NOT NULL, created_at timestamptz NOT NULL DEFAULT now(), attempt_report_updated_at timestamptz NOT NULL, student_id int NOT NULL, course_id int NOT NULL, assignment_id int NOT NULL, score_name citext NOT NULL CHECK (length(trim(score_name)) > 0), attempted_by citext NOT NULL CHECK (length(trim(attempted_by)) > 0), CONSTRAINT for_upsert UNIQUE (attempt_report_id, score_name) ); CREATE INDEX ON attempt_scores (attempt_report_updated_at); COMMENT ON TABLE attempt_scores IS $$The collection of assignment scores extracted from the LMS database.$$; COMMENT ON COLUMN attempt_scores.attempt_report_id IS $$Each assignment attempt has an associated attempt report (attempt_reports) where the scores of the attempt is recorded. This column is the pk value from that table.$$; COMMENT ON COLUMN attempt_scores.score_value IS $$The score's value.$$; COMMENT ON COLUMN attempt_scores.created_at IS $$The timestamp the record was created.$$; COMMENT ON COLUMN attempt_scores.student_id IS $$The student's ID.$$; COMMENT ON COLUMN attempt_scores.course_id IS $$The course's primary key in the LMS database.$$; COMMENT ON COLUMN attempt_scores.assignment_id IS $$The assignment's primary key in the LMS database.$$; COMMENT ON COLUMN attempt_scores.score_name IS $$The source/name of the score.$$; COMMENT ON COLUMN attempt_scores.attempted_by IS 'The users.role column in LMS'; COMMENT ON COLUMN attempt_scores.attempt_report_updated_at IS $$The timestamp value of attempt_reports.updated_at on the LMS side. We use it to find new rows added since the last time we exported to Salesforce.$$; Table metadata: relname | relpages | reltuples | relallvisible | relkind | relnatts | relhassubclass | reloptions | pg_table_size ----------------+----------+-----------+---------------+---------+----------+----------------+------------+--------------- attempt_scores | 130235 | 9352640 | 0 | r | 10 | f | [NULL] | 1067180032 Other context: The PostgreSQL database is an Amazon RDS instance. Next up is the query and a description of what it's supposed to do. -- What this query is supposed to do is to compute averages for a set of scoring/learning metrics but it's not so -- straight forward. There is an umbrella metric that summarises the others called the student performance index (SPI) -- and the folks who want this info want the averages to be driven by the SPI. So the basic algorithm is that for each -- student/assignment pair, find the assignment that has the highest SPI then use that to collect and average the -- component metrics. EXPLAIN (ANALYZE, BUFFERS) WITH max_spi AS ( SELECT student_id, assignment_id, max(score_value) spi FROM attempt_scores WHERE score_name = 'student_performance_index' GROUP BY student_id, assignment_id HAVING max(score_value) > 0 ), reports AS ( SELECT max(attempt_report_id) attempt_report_id, max(score_value) spi FROM max_spi m NATURAL JOIN attempt_scores WHERE score_value = m.spi GROUP BY student_id, assignment_id ) SELECT avg(spi) spi, avg(CASE score_name WHEN 'digital_clinical_experience' THEN score_value END) dce, avg(CASE score_name WHEN 'tier1_subjective_data_collection' THEN score_value END) sdc FROM reports NATURAL JOIN attempt_scores; Finally, the EXPLAIN output and some links. QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=672426.02..672426.03 rows=1 width=24) (actual time=903359.923..903368.957 rows=1 loops=1) Buffers: shared hit=6167172 read=4199539, temp read=99551 written=99678 I/O Timings: read=839121.853 -> Nested Loop (cost=672389.80..672425.91 rows=8 width=37) (actual time=36633.920..885232.956 rows=7034196 loops=1) Buffers: shared hit=6167172 read=4199539, temp read=99551 written=99678 I/O Timings: read=839121.853 -> GroupAggregate (cost=672389.37..672389.39 rows=1 width=24) (actual time=36628.945..41432.502 rows=938244 loops=1) Group Key: attempt_scores_2.student_id, attempt_scores_2.assignment_id Buffers: shared hit=191072 read=329960, temp read=99551 written=99678 I/O Timings: read=18210.866 -> Sort (cost=672389.37..672389.37 rows=1 width=24) (actual time=36628.111..39562.421 rows=2500309 loops=1) Sort Key: attempt_scores_2.student_id, attempt_scores_2.assignment_id Sort Method: external merge Disk: 83232kB Buffers: shared hit=191072 read=329960, temp read=99551 written=99678 I/O Timings: read=18210.866 -> Gather (cost=425676.58..672389.36 rows=1 width=24) (actual time=25405.650..34716.694 rows=2500309 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=191072 read=329960, temp read=78197 written=78260 I/O Timings: read=18210.866 -> Hash Join (cost=424676.58..671389.26 rows=1 width=24) (actual time=25169.930..34121.825 rows=833436 loops=3) Hash Cond: ((attempt_scores_1.student_id = attempt_scores_2.student_id) AND (attempt_scores_1.assignment_id = attempt_scores_2.assignment_id) AND (attempt_scores_1.score_value = (max(attempt_scores_2.score_value)))) Buffers: shared hit=191072 read=329960, temp read=78197 written=78260 I/O Timings: read=18210.866 -> Parallel Seq Scan on attempt_scores attempt_scores_1 (cost=0.00..169204.33 rows=3896933 width=24) (actual time=0.013..5775.887 rows=3118127 loops=3) Buffers: shared hit=41594 read=88641 I/O Timings: read=14598.128 -> Hash (cost=419397.94..419397.94 rows=235808 width=16) (actual time=25160.038..25160.555 rows=938244 loops=3) Buckets: 131072 (originally 131072) Batches: 16 (originally 4) Memory Usage: 3786kB Buffers: shared hit=149408 read=241311, temp read=15801 written=27438 I/O Timings: read=3610.261 -> GroupAggregate (cost=392800.28..417039.86 rows=235808 width=16) (actual time=23175.268..24589.121 rows=938244 loops=3) Group Key: attempt_scores_2.student_id, attempt_scores_2.assignment_id Filter: (max(attempt_scores_2.score_value) > '0'::double precision) Rows Removed by Filter: 2908 Buffers: shared hit=149408 read=241311, temp read=15801 written=15864 I/O Timings: read=3610.261 -> Sort (cost=392800.28..395879.64 rows=1231743 width=16) (actual time=23174.917..23654.979 rows=1206355 loops=3) Sort Key: attempt_scores_2.student_id, attempt_scores_2.assignment_id Sort Method: external merge Disk: 30760kB Worker 0: Sort Method: external merge Disk: 30760kB Worker 1: Sort Method: external merge Disk: 30760kB Buffers: shared hit=149408 read=241311, temp read=15801 written=15864 I/O Timings: read=3610.261 -> Seq Scan on attempt_scores attempt_scores_2 (cost=0.00..247143.00 rows=1231743 width=16) (actual time=16980.832..21585.313 rows=1206355 loops=3) Filter: (score_name = 'student_performance_index'::citext) Rows Removed by Filter: 8148027 Buffers: shared hit=149394 read=241311 I/O Timings: read=3610.261 -> Index Scan using for_upsert on attempt_scores (cost=0.43..36.42 rows=8 width=37) (actual time=0.394..0.896 rows=7 loops=938244) Index Cond: (attempt_report_id = (max(attempt_scores_1.attempt_report_id))) Buffers: shared hit=5976100 read=3869579 I/O Timings: read=820910.987 Planning Time: 14.357 ms Execution Time: 903426.284 ms (55 rows) Time: 903571.689 ms (15:03.572) The explain output visualized on explain.depesz.com: https://explain.depesz.com/s/onPZ -- Thank you for your time and consideration, Dane