Re: Question on releasing record batch

ZHOU Yuan Thu, 17 Jun 2021 19:45:17 -0700

Hi Weston, thank you for the inputs!

I was watching the memory usage using "top -p `pidof test`", the size of
the residence memory is not reduced.


With the new counter I saw the memory is freed immediately on arrow side.
So this is related to my allocator.
I actually disbaled jemalloc/mimalloc during arrow build but didn't realize
the glibc allocator will also have similar behavior.
I'll try to do more debugging on the allocator side then.

Thanks again!

thanks, -yuan


On Fri, Jun 18, 2021 at 10:21 AM Weston Pace <weston.p...@gmail.com> wrote:

> The only owner of input_batch that I can see here is the shared_ptr
> that you are resetting so I would expect the memory to be freed.
>
> How are you measuring memory usage?  The dynamic allocators (mimalloc
> / jemalloc) don't always release memory as soon as they possibly can.
> Even malloc will sometimes be forced to hang onto memory due to
> fragmentation issues, etc.  Can you try measuring memory usage with
> arrow::default_memory_pool()->bytes_allocated(); ?
>
> On Thu, Jun 17, 2021 at 3:48 PM ZHOU Yuan <dunk...@gmail.com> wrote:
> >
> > Hi Arrow developers,
> >
> > ran into a memory footprint issue after releasing the record batch
> > manually. The logic of my program is:
> > 0. read many record batches
> > 1. process on these batches
> > 2. dump the intermediate results on disk
> > 3. close the batches
> > 4. logics for other operations
> >
> > I expect the memory footprint will drop after stage #3, however it looks
> > like the memory is not released.
> > I then write a small test program to check the behavior. Running with GDB
> > the de-constructor of recordbatch
> > is indeedly called in the "input_batch.reset()", but the memory is not
> > released until I cancel the whole program.
> >
> > I understand the lifetime of recodrbatch is controlled by # of owners of
> > shared_ptr, so it will be released eventually,
> > but are there any APIs or ways to release it manually in the middle of my
> > program?
> >
> > attached is the testing code snip. Thanks!
> >
> > =======
> >   auto f0 = field("f0", float64());
> >   auto f1 = field("f1", uint32());
> >   auto sch = arrow::schema({f0, f1});
> >
> >   std::vector<std::string> input_data_string = {
> > "[10, NaN, 4, 50, 52, 32, 11]",
> >
> > "[11, 13, 5, 51, null, 33, 12]"};
> >
> >
> >   // prepare input record Batch
> >   std::vector<std::shared_ptr<Array>> array_list;
> >   int length = -1;
> >   int i = 0;
> >   for (auto data : input_data_string) {
> >     std::shared_ptr<Array> a0;
> >
>  
> ASSERT_NOT_OK(arrow::ipc::internal::json::ArrayFromJSON(sch->field(i++)->type(),
> >
>  data.c_str(), &a0));
> >     if (length == -1) {
> >       length = a0->length();
> >     }
> >     assert(length == a0->length());
> >     array_list.push_back(a0);
> >   }
> >
> >   auto input_batch = RecordBatch::Make(sch, length,
> std::move(array_list));
> >
> >   input_batch.reset(); // should be free here?
> >   std::this_thread::sleep_for(std::chrono::seconds(20));
> > thanks, -yuan
>

Re: Question on releasing record batch

Reply via email to