[ceph-users] Jewel Multisite RGW Memory Issues

Ben Agricola Mon, 20 Jun 2016 02:01:18 -0700

I have 2 distinct clusters configured, in 2 different locations, and 1
zonegroup.


Cluster 1 has ~11TB of data currently on it, S3 / Swift backups via the
duplicity backup tool - each file is 25Mb and probably 20% are multipart
uploads from S3 (so 4Mb stripes) - in total 3217kobjects. This cluster has
been running for months (without RGW replication) with no issue. Each site
has 1 RGW instance at the moment.

I recently set up the second cluster on identical hardware in a secondary
site. I configured a multi-site setup, with both of these sites in an
active-active configuration. The second cluster has no active data set, so
I would expect site 1 to start mirroring to site 2 - and it does.

Unfortunately as soon as the RGW syncing starts to run, the resident memory
usage of radosgw instances on both clusters balloons massively until the
process is OOMed. This isn't a slow leak - when testing I've found that the
radosgw processes on either side can consume up to 300MB/s of extra RSS per
*second*, completely ooming a machine with 96GB of ram in approximately 20
minutes.

If I stop the radosgw processes on one cluster (i.e. breaking replication)
then the memory usage of the radosgw processes on the other cluster stays
at around 100-500MB and does not really increase over time.

Obviously this makes multi-site replication completely unusable so
wondering if anyone has a fix or workaround. I noticed some pull requests
have been merged into the master branch for RGW memory leak fixes so I
switched to v10.2.0-2453-g94fac96 from autobuild packages, it seems like
this slows the memory increase slightly but not enough to make replication
usable yet.

I've tried valgrinding the radosgw process but doesn't come up with
anything obviously leaking (I could be doing it wrong), but an example of
the memory ballooning is captured by collectd:
http://i.imgur.com/jePYnwz.png - this memory usage is *all* on the radosgw
process RSS.

Anyone else seen this?

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Jewel Multisite RGW Memory Issues

Reply via email to