Hi Arrow developers,
ran into a memory footprint issue after releasing the record batch
manually. The logic of my program is:
0. read many record batches
1. process on these batches
2. dump the intermediate results on disk
3. close the batches
4. logics for other operations
I expect the memory footprint will drop after stage #3, however it looks
like the memory is not released.
I then write a small test program to check the behavior. Running with GDB
the de-constructor of recordbatch
is indeedly called in the "input_batch.reset()", but the memory is not
released until I cancel the whole program.
I understand the lifetime of recodrbatch is controlled by # of owners of
shared_ptr, so it will be released eventually,
but are there any APIs or ways to release it manually in the middle of my
program?
attached is the testing code snip. Thanks!
=======
auto f0 = field("f0", float64());
auto f1 = field("f1", uint32());
auto sch = arrow::schema({f0, f1});
std::vector<std::string> input_data_string = {
"[10, NaN, 4, 50, 52, 32, 11]",
"[11, 13, 5, 51, null, 33, 12]"};
// prepare input record Batch
std::vector<std::shared_ptr<Array>> array_list;
int length = -1;
int i = 0;
for (auto data : input_data_string) {
std::shared_ptr<Array> a0;
ASSERT_NOT_OK(arrow::ipc::internal::json::ArrayFromJSON(sch->field(i++)->type(),
data.c_str(), &a0));
if (length == -1) {
length = a0->length();
}
assert(length == a0->length());
array_list.push_back(a0);
}
auto input_batch = RecordBatch::Make(sch, length, std::move(array_list));
input_batch.reset(); // should be free here?
std::this_thread::sleep_for(std::chrono::seconds(20));
thanks, -yuan