Hi,
> That's helpful, thanks! > > One thing to note is that postgres' work_mem is confusing - it applies to > individual query execution nodes, not the whole query. Additionally, when > you > use parallel workers, each of the parallel workers can use the "full" > work_mem, rather than work_mem being evenly distributed across workers. > > Within that, the memory usage in the EXPLAIN ANALYZE isn't entirely > unexpected > (I'd say it's unreasonable if you're not a postgres dev, but that's a > different issue). > > We can see each of the Hash nodes use ~1GB, which is due to > (1 leader + 4 workers) * work_mem = 5 * 200MB = 1GB. > > In this specific query we probably could free the memory in the "lower" > hash > join nodes once the node directly above has finished building, but we don't > have the logic for that today. > I would understand 1 GB, even 2GB (considering hash_mem_multiplier) but the server ran out of memory with more than 64 GB. > > Of course that doesn't explain why the memory usage / temp file creation > is so > extreme with a lower limit / fewer workers. There aren't any bad > statistics > afaics, nor can I really see a potential for a significant skew, it looks > to > me that the hashtables are all built on a single text field and that nearly > all the input rows are distinct. > > Could you post the table definition (\d+) and the database definition > (\l). One random guess I have is that you ended up with a > "non-deterministic" > collation for that column and that we end up with a bad hash due to that. > I was able to eliminate the columns not involved in the query while still retaining the problematic behavior (that's the reason for the new table names): postgres=# \d inreq Table "public.inreq" Column | Type | Collation | Nullable | Default -------------------------+-----------------------------+-----------+----------+--------- input_sequence | bigint | | not null | msg_type | character varying(8) | | | originalrequest_id | bigint | | | receive_time | timestamp without time zone | | | related_output_sequence | bigint | | | msg_status | character varying(15) | | | Indexes: "inreq_pkey" PRIMARY KEY, btree (input_sequence) "inreq_originalrequest_id_idx" btree (originalrequest_id) postgres=# \d outreq Table "public.outreq" Column | Type | Collation | Nullable | Default ------------------------+--------+-----------+----------+--------- output_sequence | bigint | | not null | input_sequence | bigint | | | reply_input_sequence | bigint | | | related_input_sequence | bigint | | | Indexes: "outreq_pkey" PRIMARY KEY, btree (output_sequence) "outreq_input_sequence_idx" btree (input_sequence) "outreq_reply_input_sequence_idx" btree (reply_input_sequence) postgres=# SELECT datname, datcollate FROM pg_database WHERE datname = current_database(); datname | datcollate ----------+------------- postgres | en_US.UTF-8 (1 row) A dump of the two tables above can be found at https://drive.google.com/file/d/1ep1MYjNzlFaICL3GlPZaMdpOxRG6WCGz/view?usp=sharing (compressed size 1GB; size of database after import 14 GB ). # prepare my_query (timestamp,bigint) as SELECT t.input_sequence, rec_tro.output_sequence, r.input_sequence, rpl_rec_tro.output_sequence, rpl_snd_tro.output_sequence, snd_tro.output_sequence, t.msg_type FROM inreq t LEFT JOIN outreq rec_tro ON rec_tro.input_sequence = t.input_sequence LEFT JOIN inreq r ON r.originalRequest_id = t.input_sequence LEFT JOIN outreq rpl_rec_tro ON rpl_rec_tro.input_sequence = r.input_sequence LEFT JOIN outreq rpl_snd_tro ON rpl_snd_tro.reply_input_sequence = r.input_sequence LEFT JOIN outreq snd_tro ON snd_tro.reply_input_sequence = t.input_sequence WHERE t.receive_time < $1 AND t.input_sequence < $2 AND t.msg_status IN ('COMPLETED', 'REJECTED') ORDER BY t.msg_status DESC, t.input_sequence ; # EXPLAIN (ANALYZE,BUFFERS) EXECUTE my_query('2024-05-18 00:00:00', 202406020168279904); Best Regards, Radu