Postgresql OOM

2024-06-06 Thread Radu Radutiu
Hello all,

I have a query that forces an out of memory error, where the OS will kill
the postgresql process.
The query plan (run immediately after a vacuum analyze) is at
https://explain.depesz.com/s/ITQI#html .


PostgreSQL version 16.3, running on RHEL 8.9, 16 vCPU, 64 GB RAM, 32 GB swap

shared_buffers=8G
effective_cache_size=24G
maintenance_work_mem=2G
work_mem=104857kB
default_statistics_target = 100
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
jit=off


It looks like the excessive memory allocation is reported in
HashSpillContext. I've attached the dump of the memory context for the 5
processes (query + 4 parallel workers) some time after query start. I also
see a huge number of temporary files being created. For the time being I've
set enable_parallel_hash = 'off' and the problem went away.

I've seen a potentially similar problem reported in
https://www.postgresql.org/message-id/flat/20230516200052.sbg6z4ghcmsas3wv%40liskov#f6059259c7c9251fb8c17f5793a2d427
.


Any idea on how to identify the problem? I can reproduce it on demand.
Should I report it pgsql-bugs?

Best regards,
Radu
2024-06-05 15:58:53.806 +08 [338795] LOG:  logging memory contexts of PID 338795
2024-06-05 15:58:53.807 +08 [338795] LOG:  level: 0; TopMemoryContext: 2269696 
total in 8 blocks; 33432 free (11 chunks); 2236264 used
2024-06-05 15:58:53.807 +08 [338795] LOG:  level: 1; HandleParallelMessages: 
8192 total in 1 blocks; 7928 free (0 chunks); 264 used
2024-06-05 15:58:53.807 +08 [338795] LOG:  level: 1; TableSpace cache: 8192 
total in 1 blocks; 2128 free (0 chunks); 6064 used
2024-06-05 15:58:53.807 +08 [338795] LOG:  level: 1; TopTransactionContext: 
8192 total in 1 blocks; 5736 free (0 chunks); 2456 used
2024-06-05 15:58:53.808 +08 [338795] LOG:  level: 1; Prepared Queries: 16384 
total in 2 blocks; 6696 free (3 chunks); 9688 used
2024-06-05 15:58:53.808 +08 [338795] LOG:  level: 1; Type information cache: 
24368 total in 2 blocks; 2648 free (0 chunks); 21720 used
2024-06-05 15:58:53.809 +08 [338795] LOG:  level: 1; Operator lookup cache: 
24576 total in 2 blocks; 10792 free (3 chunks); 13784 used
2024-06-05 15:58:53.809 +08 [338795] LOG:  level: 1; RowDescriptionContext: 
8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
2024-06-05 15:58:53.809 +08 [338795] LOG:  level: 1; MessageContext: 8192 total 
in 1 blocks; 2256 free (0 chunks); 5936 used
2024-06-05 15:58:53.809 +08 [338795] LOG:  level: 1; Operator class cache: 8192 
total in 1 blocks; 592 free (0 chunks); 7600 used
2024-06-05 15:58:53.810 +08 [338795] LOG:  level: 1; smgr relation table: 32768 
total in 3 blocks; 16880 free (8 chunks); 15888 used
2024-06-05 15:58:53.810 +08 [338795] LOG:  level: 1; PgStat Shared Ref Hash: 
7216 total in 2 blocks; 688 free (0 chunks); 6528 used
2024-06-05 15:58:53.810 +08 [338795] LOG:  level: 1; PgStat Shared Ref: 8192 
total in 4 blocks; 3704 free (4 chunks); 4488 used
2024-06-05 15:58:53.811 +08 [338795] LOG:  level: 1; PgStat Pending: 16384 
total in 5 blocks; 3288 free (10 chunks); 13096 used
2024-06-05 15:58:53.811 +08 [338795] LOG:  level: 1; TransactionAbortContext: 
32768 total in 1 blocks; 32504 free (0 chunks); 264 used
2024-06-05 15:58:53.811 +08 [338795] LOG:  level: 1; Portal hash: 8192 total in 
1 blocks; 592 free (0 chunks); 7600 used
2024-06-05 15:58:53.812 +08 [338795] LOG:  level: 1; TopPortalContext: 8192 
total in 1 blocks; 7664 free (0 chunks); 528 used
2024-06-05 15:58:53.812 +08 [338795] LOG:  level: 2; PortalHoldContext: 24624 
total in 2 blocks; 7400 free (0 chunks); 17224 used
2024-06-05 15:58:53.823 +08 [338795] LOG:  level: 2; PortalContext: 527776 
total in 83 blocks; 3760 free (6 chunks); 524016 used: 
2024-06-05 15:58:53.824 +08 [338795] LOG:  level: 3; ExecutorState: 58826880 
total in 19 blocks; 3932992 free (11 chunks); 54893888 used
2024-06-05 15:58:53.824 +08 [338795] LOG:  level: 4; HashTableContext: 8192 
total in 1 blocks; 7752 free (0 chunks); 440 used
2024-06-05 15:58:53.825 +08 [338795] LOG:  level: 5; HashSpillContext: 8192 
total in 1 blocks; 4008 free (9 chunks); 4184 used
2024-06-05 15:58:53.825 +08 [338795] LOG:  level: 5; HashBatchContext: 8192 
total in 1 blocks; 7928 free (0 chunks); 264 used
2024-06-05 15:58:53.825 +08 [338795] LOG:  level: 4; HashTableContext: 8192 
total in 1 blocks; 7752 free (0 chunks); 440 used
2024-06-05 15:58:53.840 +08 [338795] LOG:  level: 5; HashSpillContext: 
339738672 total in 47 blocks; 3740880 free (163341 chunks); 335997792 used
2024-06-05 15:58:53.840 +08 [338795] LOG:  level: 5; HashBatchContext: 8192 
total in 1 blocks; 7928 free (0 chunks); 264 used
2024-06-05 15:58:53.841 +08 [338795] LOG:  level: 4; HashTableContext: 8192 
total in 1 blocks; 7752 free (0 chunks); 440 used
2024-06-05 15:58:53.841 +08 [338795] LOG:  level: 5; HashSpillContext: 8192 
total in 1 blocks; 4720 free (0 chunks); 3472 used
2024-06-05 15:58:53.841 +08 [338795] LOG:  level: 5; HashBatch

Re: Postgresql OOM

2024-06-06 Thread Radu Radutiu
>
>
>> I am not qualified to answer on the OOM issue but why are you joining the
> same table (outputrequest) 4 times (using an identical join condition)?
> This essentially does a cross join, if an input_sequence value has say,
> 1000 related rows in outputrequest, you will be getting 1000^4 rows in the
> result set.
>

The query itself runs fine in a reasonable time with enable_parallel_hash =
'off'. I see two problems - one is the wrong execution plan (right after
running analyze), the second and the most important is the huge memory
usage (far exceeding work_mem and shared buffers) leading to OOM.
See https://explain.depesz.com/s/yAqS for the explain plan
with enable_parallel_hash = 'off.


Re: Postgresql OOM

2024-06-07 Thread Radu Radutiu
>
>
>
>> The planner should recognize this situation and avoid use of hash
>> join in such cases, but maybe the statistics aren't reflecting the
>> problem, or maybe there's something wrong with the logic specific
>> to parallel hash join.  You've not really provided enough information
>> to diagnose why the poor choice of plan.
>>
>> regards, tom lane
>>
>
> Thanks for looking into this. I'm not sure what information would be
> needed to look at the choice of plan.
> The statistics for the join conditions in the query would be:
>  join_condition | min_count | max_count | avg_count
> +---+---+
>  snd_tro| 0 | 0 | 0.
>  rpl_rec_tro| 0 | 2 | 0.99869222814474470477
>  rec_tro| 0 | 2 | 0.99869222814474470477
>  rpl_snd_tro| 0 | 0 | 0.
>  r  | 0 | 1 | 0.49850916663490161653
>
>
> The relevant columns for the tables are:
> postgres=# \d inputrequest
>Table "public.inputrequest"
>   Column  |Type | Collation |
> Nullable | Default
>
> --+-+---+--+-
>  input_sequence   | bigint  |   | not
> null |
>  msg_type | character varying(8)|   | not
> null |
>  msg_content  | text|   | not
> null |
>  msg_reference| character varying(35)   |   |
>  |
>  originalrequest_id   | bigint  |   |
>  |
>  receive_time | timestamp without time zone |   | not
> null |
>  related_output_sequence  | bigint  |   |
>  |
>  msg_status   | character varying(15)   |   |
>  |
>
> Indexes:
> "inputrequest_pkey" PRIMARY KEY, btree (input_sequence)
> "inputrequest_originalrequest_id_idx" btree (originalrequest_id)
>
> postgres=# \d outputrequest
>  Table "public.outputrequest"
>  Column |Type | Collation |
> Nullable | Default
>
> +-+---+--+-
>  output_sequence| bigint  |   | not
> null |
>  input_sequence | bigint  |   |
>|
>  msg_type   | character varying(8)|   |
>|
>  msg_content| text|   | not
> null |
>  msg_reference  | character varying(35)   |   |
>|
>  reply_input_sequence   | bigint  |   |
>|
>  status | integer |   | not
> null |
>  related_input_sequence | bigint  |   |
>|
> Indexes:
> "outputrequest_pkey" PRIMARY KEY, btree (output_sequence)
> "outputrequest_input_sequence_idx" btree (input_sequence)
> "outputrequest_reply_input_sequence_idx" btree (reply_input_sequence)
>
>
I wonder if our choice of primary keys (input_sequence and output_sequence)
has something to do with the skew in the hash bucket distribution. We use
the following format: mmddxx , where  is more or less a
sequence and xx is the node generating the id, i.e. 01,02,etc (with only
one or two values in the dataset).

I wonder if it would be difficult to have an upper limit on the private
memory that can be allocated by one process (or all processes similar to
Oracle's pga_aggregate_limit). I would rather have one query failing with
an error message instead of postgres eating up all memory and swap on the
server.

Best regards,
Radu


Re: Postgresql OOM

2024-06-07 Thread Radu Radutiu
On Fri, Jun 7, 2024 at 7:59 PM Andres Freund  wrote:

> Hi,
>
> On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote:
> > I have a query that forces an out of memory error, where the OS will kill
> > the postgresql process.
>
> FWIW, it can be useful to configure the OS with strict memory overcommit.
> That
> causes postgres to fail more gracefully, because the OOM killer won't be
> invoked.
>
>
> > The query plan (run immediately after a vacuum analyze) is at
> > https://explain.depesz.com/s/ITQI#html .
>
> Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the
> number of
> workers? It'd be useful to get some of the information about the actual
> numbers of tuples etc.
>
>
> Hi,
I've tried first giving more memory to the OS and mounting a tmpfs
in  pgsql_tmp. It didn't  work, I got
ERROR:  invalid DSA memory alloc request size 1140850688
CONTEXT:  parallel worker
I've seen around 2 million temporary files created before the crash.
With work_mem 100MB I was not able to get it to work with 2 parallel
workers.
Next, I've increased work_mem to 200MB and now (with extra memory and
tmpfs) it finished: https://explain.depesz.com/s/NnRC

Radu


Re: Postgresql OOM

2024-06-10 Thread Radu Radutiu
>
>
> FWIW, it can be useful to configure the OS with strict memory overcommit.
> That
> causes postgres to fail more gracefully, because the OOM killer won't be
> invoked.
>

In the current setup the database is used as an embedded db, with the
application sharing the same host as the database. Setting the memory
overcommit affects the other application, but configuring LimitAS for the
postgresql systemd service should work.
Does this mean that the fact that this query uses more than 100x the
configured work_mem is not considered a bug? Should I create a bug report?


Re: Postgresql OOM

2024-06-12 Thread Radu Radutiu
Hi,


> That's helpful, thanks!
>
> One thing to note is that postgres' work_mem is confusing - it applies to
> individual query execution nodes, not the whole query. Additionally, when
> you
> use parallel workers, each of the parallel workers can use the "full"
> work_mem, rather than work_mem being evenly distributed across workers.
>
> Within that, the memory usage in the EXPLAIN ANALYZE isn't entirely
> unexpected
> (I'd say it's unreasonable if you're not a postgres dev, but that's a
> different issue).
>
> We can see each of the Hash nodes use ~1GB, which is due to
> (1 leader + 4 workers) * work_mem = 5 * 200MB = 1GB.
>
> In this specific query we probably could free the memory in the "lower"
> hash
> join nodes once the node directly above has finished building, but we don't
> have the logic for that today.
>

I would understand 1 GB, even 2GB (considering hash_mem_multiplier) but the
server ran out of memory with more than 64 GB.

>
> Of course that doesn't explain why the memory usage / temp file creation
> is so
> extreme with a lower limit / fewer workers.  There aren't any bad
> statistics
> afaics, nor can I really see a potential for a significant skew, it looks
> to
> me that the hashtables are all built on a single text field and that nearly
> all the input rows are distinct.
>
> Could you post the table definition (\d+) and the database definition
> (\l). One random guess I have is that you ended up with a
> "non-deterministic"
> collation for that column and that we end up with a bad hash due to that.
>

I was  able to eliminate the columns not involved in the query while still
retaining the problematic behavior (that's the reason for the new table
names):
postgres=# \d inreq
  Table "public.inreq"
 Column  |Type | Collation |
Nullable | Default
-+-+---+--+-
 input_sequence  | bigint  |   | not
null |
 msg_type| character varying(8)|   |
   |
 originalrequest_id  | bigint  |   |
   |
 receive_time| timestamp without time zone |   |
   |
 related_output_sequence | bigint  |   |
   |
 msg_status  | character varying(15)   |   |
   |
Indexes:
"inreq_pkey" PRIMARY KEY, btree (input_sequence)
"inreq_originalrequest_id_idx" btree (originalrequest_id)

postgres=# \d outreq
  Table "public.outreq"
 Column |  Type  | Collation | Nullable | Default
++---+--+-
 output_sequence| bigint |   | not null |
 input_sequence | bigint |   |  |
 reply_input_sequence   | bigint |   |  |
 related_input_sequence | bigint |   |  |
Indexes:
"outreq_pkey" PRIMARY KEY, btree (output_sequence)
"outreq_input_sequence_idx" btree (input_sequence)
"outreq_reply_input_sequence_idx" btree (reply_input_sequence)
postgres=# SELECT datname, datcollate FROM pg_database WHERE datname =
current_database();
 datname  | datcollate
--+-
 postgres | en_US.UTF-8
(1 row)

A dump of the two tables above can be found at
https://drive.google.com/file/d/1ep1MYjNzlFaICL3GlPZaMdpOxRG6WCGz/view?usp=sharing
(compressed size 1GB; size of database after import 14 GB ).

# prepare my_query (timestamp,bigint) as SELECT  t.input_sequence,
rec_tro.output_sequence, r.input_sequence, rpl_rec_tro.output_sequence,
rpl_snd_tro.output_sequence, snd_tro.output_sequence, t.msg_type  FROM
inreq t  LEFT JOIN outreq rec_tro ON rec_tro.input_sequence =
t.input_sequence LEFT JOIN inreq r ON r.originalRequest_id =
t.input_sequence   LEFT JOIN outreq rpl_rec_tro ON
rpl_rec_tro.input_sequence = r.input_sequence  LEFT JOIN outreq rpl_snd_tro
ON rpl_snd_tro.reply_input_sequence = r.input_sequence  LEFT JOIN outreq
snd_tro ON snd_tro.reply_input_sequence = t.input_sequence  WHERE
t.receive_time < $1 AND t.input_sequence < $2  AND t.msg_status IN
('COMPLETED', 'REJECTED')  ORDER BY t.msg_status DESC, t.input_sequence ;
# EXPLAIN (ANALYZE,BUFFERS)  EXECUTE my_query('2024-05-18 00:00:00',
202406020168279904);

Best Regards,
Radu