Yep, those completions are maintaining bufferlist references IIRC, so they’re definitely holding the memory buffers in place! On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <cbod...@redhat.com> wrote:
> > > On 09/12/2018 05:29 AM, Daniel Goldbach wrote: > > Hi all, > > > > We're reading from a Ceph Luminous pool using the librados asychronous > > I/O API. We're seeing some concerning memory usage patterns when we > > read many objects in sequence. > > > > The expected behaviour is that our memory usage stabilises at a small > > amount, since we're just fetching objects and ignoring their data. > > What we instead find is that the memory usage of our program grows > > linearly with the amount of data read for an interval of time, and > > then continues to grow at a much slower but still consistent pace. > > This memory is not freed until program termination. My guess is that > > this is an issue with Ceph's memory allocator. > > > > To demonstrate, we create 20000 objects of size 10KB, and of size > > 100KB, and of size 1MB: > > > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <rados/librados.h> > > > > int main() { > > rados_t cluster; > > rados_create(&cluster, "test"); > > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); > > rados_connect(cluster); > > > > rados_ioctx_t io; > > rados_ioctx_create(cluster, "test", &io); > > > > char data[1000000]; > > memset(data, 'a', 1000000); > > > > char smallobj_name[16], mediumobj_name[16], largeobj_name[16]; > > int i; > > for (i = 0; i < 20000; i++) { > > sprintf(smallobj_name, "10kobj_%d", i); > > rados_write(io, smallobj_name, data, 10000, 0); > > > > sprintf(mediumobj_name, "100kobj_%d", i); > > rados_write(io, mediumobj_name, data, 100000, 0); > > > > sprintf(largeobj_name, "1mobj_%d", i); > > rados_write(io, largeobj_name, data, 1000000, 0); > > > > printf("wrote %s of size 10000, %s of size 100000, %s of size 1000000\n", > > smallobj_name, mediumobj_name, largeobj_name); > > } > > > > return 0; > > } > > > > $ gcc create.c -lrados -o create > > $ ./create > > wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 of > > size 1000000 > > wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 of > > size 1000000 > > [...] > > wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000, > > 1mobj_19998 of size 1000000 > > wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000, > > 1mobj_19999 of size 1000000 > > > > Now we read each of these objects with the async API, into the same > > buffer. First we read just the the 10KB objects first: > > > > #include <assert.h> > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <rados/librados.h> > > > > void readobj(rados_ioctx_t* io, char objname[]); > > > > int main() { > > rados_t cluster; > > rados_create(&cluster, "test"); > > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); > > rados_connect(cluster); > > > > rados_ioctx_t io; > > rados_ioctx_create(cluster, "test", &io); > > > > char smallobj_name[16]; > > int i, total_bytes_read = 0; > > > > for (i = 0; i < 20000; i++) { > > sprintf(smallobj_name, "10kobj_%d", i); > > readobj(&io, smallobj_name); > > > > total_bytes_read += 10000; > > printf("Read %s for total %d\n", smallobj_name, total_bytes_read); > > } > > > > getchar(); > > return 0; > > } > > > > void readobj(rados_ioctx_t* io, char objname[]) { > > char data[1000000]; > > unsigned long bytes_read; > > rados_completion_t completion; > > int retval; > > > > rados_read_op_t read_op = rados_create_read_op(); > > rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval); > > retval = rados_aio_create_completion(NULL, NULL, NULL, > > &completion); > > assert(retval == 0); > > > > retval = rados_aio_read_op_operate(read_op, *io, completion, > > objname, 0); > > assert(retval == 0); > > > > rados_aio_wait_for_complete(completion); > > rados_aio_get_return_value(completion); > > } > > > > $ gcc read.c -lrados -o read_small -Wall -g && ./read_small > > Read 10kobj_0 for total 10000 > > Read 10kobj_1 for total 20000 > > [...] > > Read 10kobj_19998 for total 199990000 > > Read 10kobj_19999 for total 200000000 > > > > We read 200MB. A graph of the resident set size of the program is > > attached as mem-graph-10k.png, with seconds on x axis and KB on the y > > axis. You can see that the memory usage increases throughout, which > > itself is unexpected since that memory should be freed over time and > > we should only hold 10KB of object data in memory at a time. The rate > > of growth decreases and eventually stabilises, and by the end we've > > used 60MB of RAM. > > > > We repeat this experiment for the 100KB and 1MB objects and find that > > after all reads they use 140MB and 500MB of RAM, and memory usage > > presumably would continue to grow if there were more objects. This is > > orders of magnitude more memory than what I would expect these > > programs to use. > > > > * We do not get this behaviour with the synchronous API, and the > > memory usage remains stable at just a few MB. > > * We've found that for some reason, this doesn't happen (or doesn't > > happen as severely) if we intersperse large reads with much > > smaller reads. In this case, the memory usage seems to stabilise > > at a reasonable number. > > * Valgrind only reports a trivial amount of unreachable memory. > > * Memory usage doesn't increase in this manner if we repeatedly read > > the same object over and over again. It hovers around 20MB. > > * In other experiments we've done, with different object data and > > distributions of object sizes, we've seen memory usage grow even > > larger in proportion to the amount of data read. > > > > We maintain a long-running (order of weeks) services that read objects > > from Ceph and send them elsewhere. Over time, the memory usage of some > > of these services have grown to more than 6GB, which is unreasonable. > > > > -- > > Regards, > > Dan G > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > It looks like the async example is missing calls to rados_aio_release() > to clean up the completions. I'm not sure that would account for all of > the memory growth, but that's where I would start. Past that, running > the client under valgrind massif should help with further investigation. > > Casey > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com