This is an automated email from the ASF dual-hosted git repository.

arawat pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git


The following commit(s) were added to refs/heads/master by this push:
     new e142e48f3 IMPALA-11631 Fix impala crashes in 
impala::TopNNode::Heap::Close()
e142e48f3 is described below

commit e142e48f30f223e514101a8318503063862f691b
Author: Yida Wu <[email protected]>
AuthorDate: Fri Sep 30 10:18:17 2022 -0700

    IMPALA-11631 Fix impala crashes in impala::TopNNode::Heap::Close()
    
    The bug is introduced by IMPALA-11631, if RematerializeTuples()
    fails in ReclaimTuplePool(), it returns immediately with an error,
    however, some Heap unique_ptr in the partition_heaps_ could be
    already moved to the rematerialized_heaps, and set to nullptr,
    while the Close() of the TopNNode will try to call Close() on all
    the Heap unique_ptr in the partition_heaps_, which leads to a
    crash.
    
    There could be two issues, the calling on a nullptr as described
    above and a potential memory leaking. Because Heap doesn't call
    Close() in the destructor, the unique_ptr may not release all
    the resources properly if we don't call the Close() explicitly
    for the Heap.
    
    The patch changes the logic of moving each Heap object after
    one rematerialize process succeeds, instead, we will move all the
    Heap objects in the partition_heaps_ only when all the rematerialize
    processes succeed. Therefore, there will be no half released
    partition_heaps_, all the Heap should be called Close() during the
    TopNNode closing. Also, added checking for nullptr Heaps in the
    Close() process of TopNNode.
    
    Because it could be difficult for a testcase to inject an error
    for this case to create a crash. I did some hacking in the
    code to inject a memory allocation failure in certain cases,
    reproduced the issue, and proved the patch can solve the issue
    manually.
    
    Tests:
    Ran core tests.
    Manual test.
    
    Change-Id: Iaf45b6ef777f68e1843c076a935e4189acc6990b
    Reviewed-on: http://gerrit.cloudera.org:8080/19087
    Reviewed-by: Abhishek Rawat <[email protected]>
    Tested-by: Impala Public Jenkins <[email protected]>
---
 be/src/exec/topn-node.cc | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/be/src/exec/topn-node.cc b/be/src/exec/topn-node.cc
index 2e331a6a8..72ceb885f 100644
--- a/be/src/exec/topn-node.cc
+++ b/be/src/exec/topn-node.cc
@@ -519,7 +519,8 @@ void TopNNode::Close(RuntimeState* state) {
   if (is_closed()) return;
   if (heap_ != nullptr) heap_->Close();
   for (auto& entry : partition_heaps_) {
-    entry.second->Close();
+    DCHECK(entry.second != nullptr);
+    if (entry.second != nullptr) entry.second->Close();
   }
   if (tuple_pool_.get() != nullptr) tuple_pool_->FreeAll();
   if (order_cmp_.get() != nullptr) order_cmp_->Close(state);
@@ -691,6 +692,13 @@ Status TopNNode::ReclaimTuplePool(RuntimeState* state) {
     for (auto& entry : partition_heaps_) {
       RETURN_IF_ERROR(entry.second->RematerializeTuples(this, state, 
temp_pool.get()));
       DCHECK(entry.second->DCheckConsistency());
+    }
+    // The second loop is needed for IMPALA-11631. We only move heaps from 
partition_heap_
+    // to rematerialized_heaps once all have been rematerialized. Otherwise, 
in case of
+    // an error, we may call Close() on a nullptr or leak the memory by not 
explicitly
+    // calling Close() on the heap pointer. Maybe better to add Close() in the 
Heap
+    // destructor later.
+    for (auto& entry : partition_heaps_) {
       // The key references memory in 'tuple_pool_'. Replace it with a 
rematerialized
       // tuple.
       rematerialized_heaps.push_back(move(entry.second));

Reply via email to