Hi Arrow developers,

ran into a memory footprint issue after releasing the record batch
manually. The logic of my program is:
0. read many record batches
1. process on these batches
2. dump the intermediate results on disk
3. close the batches
4. logics for other operations

I expect the memory footprint will drop after stage #3, however it looks
like the memory is not released.
I then write a small test program to check the behavior. Running with GDB
the de-constructor of recordbatch
is indeedly called in the "input_batch.reset()", but the memory is not
released until I cancel the whole program.

I understand the lifetime of recodrbatch is controlled by # of owners of
shared_ptr, so it will be released eventually,
but are there any APIs or ways to release it manually in the middle of my
program?

attached is the testing code snip. Thanks!

=======
  auto f0 = field("f0", float64());
  auto f1 = field("f1", uint32());
  auto sch = arrow::schema({f0, f1});

  std::vector<std::string> input_data_string = {
"[10, NaN, 4, 50, 52, 32, 11]",

"[11, 13, 5, 51, null, 33, 12]"};


  // prepare input record Batch
  std::vector<std::shared_ptr<Array>> array_list;
  int length = -1;
  int i = 0;
  for (auto data : input_data_string) {
    std::shared_ptr<Array> a0;
    
ASSERT_NOT_OK(arrow::ipc::internal::json::ArrayFromJSON(sch->field(i++)->type(),
                                                            data.c_str(), &a0));
    if (length == -1) {
      length = a0->length();
    }
    assert(length == a0->length());
    array_list.push_back(a0);
  }

  auto input_batch = RecordBatch::Make(sch, length, std::move(array_list));

  input_batch.reset(); // should be free here?
  std::this_thread::sleep_for(std::chrono::seconds(20));
thanks, -yuan

Reply via email to