Hi Kalle,

Strangely and luckily, in our case the memory explosion didn't reoccur
after that incident. So I can mostly only offer moral support.

But if this bug indeed appeared between 14.2.8 and 14.2.13, then I
think this is suspicious:

   b670715eb4 osd/PeeringState: do not trim pg log past last_update_ondisk

   https://github.com/ceph/ceph/commit/b670715eb4

Given that it adds a case where the pg_log is not trimmed, I wonder if
there could be an unforeseen condition where `last_update_ondisk`
isn't being updated correctly, and therefore the osd stops trimming
the pg_log altogether.

Xie or Samuel: does that sound possible?

Cheers, Dan

On Tue, Nov 17, 2020 at 9:35 AM Kalle Happonen <kalle.happo...@csc.fi> wrote:
>
> Hello all,
> wrt: 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/
>
> Yesterday we hit a problem with osd_pglog memory, similar to the thread above.
>
> We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per 
> node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
>
> The cluster has been running fine, and (as relevant to the post) the memory 
> usage has been stable at 100 GB / node. We've had the default pg_log of 3000. 
> The user traffic doesn't seem to have been exceptional lately.
>
> Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the 
> memory usage on OSD nodes started to grow. On each node it grew steadily 
> about 30 GB/day, until the servers started OOM killing OSD processes.
>
> After a lot of debugging we found that the pg_logs were huge. Each OSD 
> process pg_log had grown to ~22GB, which we naturally didn't have memory for, 
> and then the cluster was in an unstable situation. This is significantly more 
> than the 1,5 GB in the post above. We do have ~20k pgs, which may directly 
> affect the size.
>
> We've reduced the pg_log to 500, and started offline trimming it where we 
> can, and also just waited. The pg_log size dropped to ~1,2 GB on at least 
> some nodes, but we're  still recovering, and have a lot of ODSs down and out 
> still.
>
> We're unsure if version 14.2.13 triggered this, or if the osd restarts 
> triggered this (or something unrelated we don't see).
>
> This mail is mostly to figure out if there are good guesses why the pg_log 
> size per OSD process exploded? Any technical (and moral) support is 
> appreciated. Also, currently we're not sure if 14.2.13 triggered this, so 
> this is also to put a data point out there for other debuggers.
>
> Cheers,
> Kalle Happonen
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to