Hi Arrow developers, ran into a memory footprint issue after releasing the record batch manually. The logic of my program is: 0. read many record batches 1. process on these batches 2. dump the intermediate results on disk 3. close the batches 4. logics for other operations
I expect the memory footprint will drop after stage #3, however it looks like the memory is not released. I then write a small test program to check the behavior. Running with GDB the de-constructor of recordbatch is indeedly called in the "input_batch.reset()", but the memory is not released until I cancel the whole program. I understand the lifetime of recodrbatch is controlled by # of owners of shared_ptr, so it will be released eventually, but are there any APIs or ways to release it manually in the middle of my program? attached is the testing code snip. Thanks! ======= auto f0 = field("f0", float64()); auto f1 = field("f1", uint32()); auto sch = arrow::schema({f0, f1}); std::vector<std::string> input_data_string = { "[10, NaN, 4, 50, 52, 32, 11]", "[11, 13, 5, 51, null, 33, 12]"}; // prepare input record Batch std::vector<std::shared_ptr<Array>> array_list; int length = -1; int i = 0; for (auto data : input_data_string) { std::shared_ptr<Array> a0; ASSERT_NOT_OK(arrow::ipc::internal::json::ArrayFromJSON(sch->field(i++)->type(), data.c_str(), &a0)); if (length == -1) { length = a0->length(); } assert(length == a0->length()); array_list.push_back(a0); } auto input_batch = RecordBatch::Make(sch, length, std::move(array_list)); input_batch.reset(); // should be free here? std::this_thread::sleep_for(std::chrono::seconds(20)); thanks, -yuan