Yep, those completions are maintaining bufferlist references IIRC, so
they’re definitely holding the memory buffers in place!
On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <cbod...@redhat.com> wrote:

>
>
> On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
> > Hi all,
> >
> > We're reading from a Ceph Luminous pool using the librados asychronous
> > I/O API. We're seeing some concerning memory usage patterns when we
> > read many objects in sequence.
> >
> > The expected behaviour is that our memory usage stabilises at a small
> > amount, since we're just fetching objects and ignoring their data.
> > What we instead find is that the memory usage of our program grows
> > linearly with the amount of data read for an interval of time, and
> > then continues to grow at a much slower but still consistent pace.
> > This memory is not freed until program termination. My guess is that
> > this is an issue with Ceph's memory allocator.
> >
> > To demonstrate, we create 20000 objects of size 10KB, and of size
> > 100KB, and of size 1MB:
> >
> >     #include <stdio.h>
> >     #include <stdlib.h>
> >     #include <string.h>
> >     #include <rados/librados.h>
> >
> >     int main() {
> > rados_t cluster;
> > rados_create(&cluster, "test");
> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> > rados_connect(cluster);
> >
> > rados_ioctx_t io;
> > rados_ioctx_create(cluster, "test", &io);
> >
> >         char data[1000000];
> > memset(data, 'a', 1000000);
> >
> >         char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
> >         int i;
> >         for (i = 0; i < 20000; i++) {
> > sprintf(smallobj_name, "10kobj_%d", i);
> > rados_write(io, smallobj_name, data, 10000, 0);
> >
> > sprintf(mediumobj_name, "100kobj_%d", i);
> > rados_write(io, mediumobj_name, data, 100000, 0);
> >
> > sprintf(largeobj_name, "1mobj_%d", i);
> > rados_write(io, largeobj_name, data, 1000000, 0);
> >
> > printf("wrote %s of size 10000, %s of size 100000, %s of size 1000000\n",
> >       smallobj_name, mediumobj_name, largeobj_name);
> >         }
> >
> > return 0;
> >     }
> >
> >     $ gcc create.c -lrados -o create
> >     $ ./create
> >     wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 of
> > size 1000000
> >     wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 of
> > size 1000000
> >     [...]
> >     wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000,
> > 1mobj_19998 of size 1000000
> >     wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000,
> > 1mobj_19999 of size 1000000
> >
> > Now we read each of these objects with the async API, into the same
> > buffer. First we read just the the 10KB objects first:
> >
> >     #include <assert.h>
> >     #include <stdio.h>
> >     #include <stdlib.h>
> >     #include <string.h>
> >     #include <rados/librados.h>
> >
> >     void readobj(rados_ioctx_t* io, char objname[]);
> >
> >     int main() {
> >         rados_t cluster;
> > rados_create(&cluster, "test");
> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> > rados_connect(cluster);
> >
> > rados_ioctx_t io;
> > rados_ioctx_create(cluster, "test", &io);
> >
> >         char smallobj_name[16];
> >         int i, total_bytes_read = 0;
> >
> >         for (i = 0; i < 20000; i++) {
> > sprintf(smallobj_name, "10kobj_%d", i);
> > readobj(&io, smallobj_name);
> >
> > total_bytes_read += 10000;
> > printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
> >         }
> >
> > getchar();
> >         return 0;
> >     }
> >
> >     void readobj(rados_ioctx_t* io, char objname[]) {
> >         char data[1000000];
> >         unsigned long bytes_read;
> > rados_completion_t completion;
> >         int retval;
> >
> > rados_read_op_t read_op = rados_create_read_op();
> > rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);
> >         retval = rados_aio_create_completion(NULL, NULL, NULL,
> > &completion);
> > assert(retval == 0);
> >
> >         retval = rados_aio_read_op_operate(read_op, *io, completion,
> > objname, 0);
> > assert(retval == 0);
> >
> > rados_aio_wait_for_complete(completion);
> > rados_aio_get_return_value(completion);
> >     }
> >
> >     $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
> >     Read 10kobj_0 for total 10000
> >     Read 10kobj_1 for total 20000
> >     [...]
> >     Read 10kobj_19998 for total 199990000
> >     Read 10kobj_19999 for total 200000000
> >
> > We read 200MB. A graph of the resident set size of the program is
> > attached as mem-graph-10k.png, with seconds on x axis and KB on the y
> > axis. You can see that the memory usage increases throughout, which
> > itself is unexpected since that memory should be freed over time and
> > we should only hold 10KB of object data in memory at a time. The rate
> > of growth decreases and eventually stabilises, and by the end we've
> > used 60MB of RAM.
> >
> > We repeat this experiment for the 100KB and 1MB objects and find that
> > after all reads they use 140MB and 500MB of RAM, and memory usage
> > presumably would continue to grow if there were more objects. This is
> > orders of magnitude more memory than what I would expect these
> > programs to use.
> >
> >   * We do not get this behaviour with the synchronous API, and the
> >     memory usage remains stable at just a few MB.
> >   * We've found that for some reason, this doesn't happen (or doesn't
> >     happen as severely) if we intersperse large reads with much
> >     smaller reads. In this case, the memory usage seems to stabilise
> >     at a reasonable number.
> >   * Valgrind only reports a trivial amount of unreachable memory.
> >   * Memory usage doesn't increase in this manner if we repeatedly read
> >     the same object over and over again. It hovers around 20MB.
> >   * In other experiments we've done, with different object data and
> >     distributions of object sizes, we've seen memory usage grow even
> >     larger in proportion to the amount of data read.
> >
> > We maintain a long-running (order of weeks) services that read objects
> > from Ceph and send them elsewhere. Over time, the memory usage of some
> > of these services have grown to more than 6GB, which is unreasonable.
> >
> > --
> > Regards,
> > Dan G
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> It looks like the async example is missing calls to rados_aio_release()
> to clean up the completions. I'm not sure that would account for all of
> the memory growth, but that's where I would start. Past that, running
> the client under valgrind massif should help with further investigation.
>
> Casey
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to