(impala) 03/03: IMPALA-12306: (Part 2) Make codegen cache tests with symbol emitter more robust

michaelsmith Wed, 15 Nov 2023 09:05:34 -0800

This is an automated email from the ASF dual-hosted git repository.

michaelsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git


commit db5f3b18e403865b9ce592b020b99f88669c4ac6
Author: Daniel Becker <[email protected]>
AuthorDate: Fri Jul 28 16:03:07 2023 +0200

    IMPALA-12306: (Part 2) Make codegen cache tests with symbol emitter more 
robust
    
    The codegen cache tests that include having a symbol emitter (previously
    
TestCodegenCache.{test_codegen_cache_with_asm_module_dir,test_codegen_cache_with_perf_map})
    introduced by IMPALA-12260 were added to ensure we don't produce a
    use-after-free.
    
    There are two problems with these tests:
      1. Setting the codegen cache size correctly in the tests has proved to
         be difficult because new commits and different build types (debug
         vs. release) have a huge effect on what sizes are appropriate. We
         have had many build failures because of this.
    
      2. Use-after-free is undefined behaviour and does not guarantee a
         crash but the tests rely on the crash to catch the bug described in
         IMPALA-12260.
    
    This change solves the second problem. The tests added by IMPALA-12260
    relied on a crash in the situation described there:
    'LlvmCodeGen::symbol_emitter_' is registered as an event listener with
    the current 'llvm::ExecutionEngine', then the engine is cached but the
    'LlvmCodeGen' object, which owns the symbol emitter, is destroyed at the
    end of the query. When the cached execution engine is destroyed later,
    it frees any remaining object files and notifies the symbol emitter
    about this, but the symbol emitter has already been destroyed so its
    pointer is invalid (use-after-free).
    
    However, we can't rely on the crash to detect the use-after-free because
    1) the crash is not guaranteed to happen, use-after-free is undefined
       behaviour
    2) the crash may happen well after the query has finished returning
       results.
    
    This change solves the problem in the following way:
    In 'CodegenSymbolEmitter' we introduce a counter that is incremented in
    NotifyObjectEmitted() and decremented in NotifyFreeingObject(). At the
    time of the destruction of the 'CodegenSymbolEmitter', this counter
    should be zero - if it is greater than zero, the LLVM execution engine
    to which the 'CodegenSymbolEmitter' is subscribed is still alive and it
    will try to notify the symbol emitter when the object file is freed
    (most likely when the execution engine itself is destroyed), leading to
    use-after-free
    
    We also add a hidden startup flag,
    '--codegen_symbol_emitter_log_successful_destruction_test_only'. When it
    is set to true, 'CodegenSymbolEmitter' will log a message when it is
    being destroyed correctly (i.e. when the counter is zero and
    use-after-free will not happen). We use it in the tests - if we don't
    have the expected message in the logs (after some timeout), the test
    fails.
    
    Testing:
     - modified the tests
       
TestCodegenCache.{test_codegen_cache_with_asm_module_dir,test_codegen_cache_with_perf_map}
       so they reliably detect use-after-free.
    
    Change-Id: I61b9b0de9c896f3de7eb1be7de33d822b1ab70d0
    Reviewed-on: http://gerrit.cloudera.org:8080/20318
    Reviewed-by: Impala Public Jenkins <[email protected]>
    Tested-by: Impala Public Jenkins <[email protected]>
---
 be/src/codegen/codegen-symbol-emitter.cc   | 19 ++++++++++++++++
 be/src/codegen/codegen-symbol-emitter.h    | 15 +++++++++++--
 tests/custom_cluster/test_codegen_cache.py | 36 +++++++++++++++++++++++++++---
 3 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/be/src/codegen/codegen-symbol-emitter.cc 
b/be/src/codegen/codegen-symbol-emitter.cc
index ce8d3c547..cad9108dc 100644
--- a/be/src/codegen/codegen-symbol-emitter.cc
+++ b/be/src/codegen/codegen-symbol-emitter.cc
@@ -43,14 +43,32 @@
 using std::hex;
 using std::rename;
 
+DEFINE_bool_hidden(codegen_symbol_emitter_log_successful_destruction_test_only,
 false,
+    "Log when a 'CodegenSymbolCache' object is destroyed correctly, i.e. "
+    "'non_freed_objects_' is zero. Only used for testing.");
+
 namespace impala {
 
 SpinLock CodegenSymbolEmitter::perf_map_lock_;
 unordered_map<const void*, vector<CodegenSymbolEmitter::PerfMapEntry>>
     CodegenSymbolEmitter::perf_map_;
 
+CodegenSymbolEmitter::~CodegenSymbolEmitter() {
+  if (non_freed_objects_ != 0) {
+    stringstream s;
+    s << "The difference between the number of emitted and freed object files "
+        << "should be zero, but was " << non_freed_objects_ << ".";
+    const string& msg = s.str();
+    DCHECK(false) << msg;
+    LOG(WARNING) << msg;
+  } else if 
(FLAGS_codegen_symbol_emitter_log_successful_destruction_test_only) {
+    LOG(INFO) << "Successful destruction of CodegenSymbolEmitter object.";
+  }
+}
+
 void CodegenSymbolEmitter::NotifyObjectEmitted(const llvm::object::ObjectFile& 
obj,
     const llvm::RuntimeDyld::LoadedObjectInfo& loaded_obj) {
+  non_freed_objects_++;
   vector<PerfMapEntry> perf_map_entries;
 
   ofstream asm_file;
@@ -89,6 +107,7 @@ void CodegenSymbolEmitter::NotifyObjectEmitted(const 
llvm::object::ObjectFile& o
 }
 
 void CodegenSymbolEmitter::NotifyFreeingObject(const llvm::object::ObjectFile& 
obj) {
+  non_freed_objects_--;
   if (emit_perf_map_) {
     lock_guard<SpinLock> perf_map_lock(perf_map_lock_);
     DCHECK(perf_map_.find(obj.getData().data()) != perf_map_.end());
diff --git a/be/src/codegen/codegen-symbol-emitter.h 
b/be/src/codegen/codegen-symbol-emitter.h
index bad30f52b..939010398 100644
--- a/be/src/codegen/codegen-symbol-emitter.h
+++ b/be/src/codegen/codegen-symbol-emitter.h
@@ -41,9 +41,11 @@ namespace impala {
 /// and symbols for perf profiling.
 class CodegenSymbolEmitter : public llvm::JITEventListener {
  public:
-  CodegenSymbolEmitter(std::string id) : id_(id), emit_perf_map_(false) { }
+  CodegenSymbolEmitter(std::string id)
+      : id_(id), emit_perf_map_(false), non_freed_objects_(0)
+  { }
 
-  ~CodegenSymbolEmitter() { }
+  ~CodegenSymbolEmitter();
 
   /// Write the current contents of 'perf_map_' to /tmp/perf-<pid>.map
   /// Atomically updates the current map by writing to a temporary file then 
moving it.
@@ -90,6 +92,15 @@ class CodegenSymbolEmitter : public llvm::JITEventListener {
   /// If true, emit perf map info to /tmp/perf-<pid>.map.
   bool emit_perf_map_;
 
+  /// The number of object files that have been emitted but not freed yet. The 
counter is
+  /// incremented in NotifyObjectEmitted() and decremented in 
NotifyFreeingObject(). At
+  /// the time of the destruction of this object, this should be 0 - if it is 
greater than
+  /// 0, the LLVM execution engine to which this object is subscribed is still 
alive and
+  /// it will try to notify this object when the object file is freed (most 
likely when
+  /// the execution engine itself is destroyed), leading to use-after-free. See
+  /// IMPALA-12306 for more.
+  int non_freed_objects_;
+
   /// File to emit disassembly to. If empty string, don't emit.
   std::string asm_path_;
 
diff --git a/tests/custom_cluster/test_codegen_cache.py 
b/tests/custom_cluster/test_codegen_cache.py
index a10685ec9..560f847fc 100644
--- a/tests/custom_cluster/test_codegen_cache.py
+++ b/tests/custom_cluster/test_codegen_cache.py
@@ -141,6 +141,7 @@ class TestCodegenCache(CustomClusterTestSuite):
       "select sum(identity(bigint_col)) from functional.alltypes", False)
 
   SYMBOL_EMITTER_TESTS_IMPALAD_ARGS = "--cache_force_single_shard=1 \
+      --codegen_symbol_emitter_log_successful_destruction_test_only=1 \
       --codegen_cache_entry_bytes_charge_overhead=10000000 
--codegen_cache_capacity=25MB "
 
   @pytest.mark.execute_serially
@@ -183,12 +184,28 @@ class TestCodegenCache(CustomClusterTestSuite):
     the '--codegen_cache_entry_bytes_charge_overhead' startup flag to
     artificially assign a higher size to the cache entries, compared to which
     the real size, and therefore also changes in the real size, are
-    insignificant."""
+    insignificant.
+
+    This test verifies that the use-after-free scenario doesn't happen. We 
can't rely on
+    the crash to detect it because
+    1) the crash is not guaranteed to happen, use-after-free is undefined 
behaviour
+    2) the crash may happen well after the query has finished returning 
results.
+
+    Therefore in 'CodegenSymbolEmitter' we count how many object files have 
been emitted
+    and freed. If the difference is greater than zero at the time of the 
destruction of
+    the 'CodegenSymbolEmitter', the LLVM execution engine to which the symbol 
emitter is
+    subscribed is still alive and will attempt to notify the symbol emitter 
when it will
+    have already been destroyed, leading to use-after-free.
+
+    When the --codegen_symbol_emitter_log_successful_destruction_test_only 
flag is set to
+    true, 'CodegenSymbolEmitter' will log a message when it is being destroyed 
correctly
+    (i.e. when use-after-free will not happen). If we don't have the expected 
message in
+    the logs (after some timeout), the test fails."""
 
     exec_options = copy(vector.get_value('exec_option'))
     exec_options['exec_single_node_rows_threshold'] = 0
 
-    q1 = """select int_col, tinyint_col from functional_parquet.alltypessmall
+    q1 = """select int_col from functional_parquet.alltypessmall
         order by int_col desc limit 20"""
     q2 = """select t1.bool_col, t1.year, t1.month
          from functional_parquet.alltypes t1
@@ -201,14 +218,27 @@ class TestCodegenCache(CustomClusterTestSuite):
              t1.month"""
 
     self._check_metric_expect_init()
+
+    symbol_emitter_ok_msg = "Successful destruction of CodegenSymbolEmitter 
object."
+
+    # ## First query
     self.execute_query_expect_success(self.client, q1, exec_options)
     cache_entries_in_use = 
self.get_metric('impala.codegen-cache.entries-in-use')
+    cache_entries_evicted = 
self.get_metric('impala.codegen-cache.entries-evicted')
     assert cache_entries_in_use > 0
     assert self.get_metric('impala.codegen-cache.hits') == 0
+    # Initialising the cross-compiled modules also consumes an LLVM executor 
engine.
+    expected_num_msg = cache_entries_evicted + 1
+    self.assert_impalad_log_contains("INFO", symbol_emitter_ok_msg, 
expected_num_msg)
 
+    # ## Second query
     self.execute_query_expect_success(self.client, q2, exec_options)
-    assert self.get_metric('impala.codegen-cache.entries-evicted') >= 
cache_entries_in_use
     assert self.get_metric('impala.codegen-cache.hits') == 0
+    cache_entries_evicted = 
self.get_metric('impala.codegen-cache.entries-evicted')
+    assert cache_entries_evicted >= cache_entries_in_use
+    # Initialising the cross-compiled modules also consumes an LLVM executor 
engine.
+    expected_num_msg = cache_entries_evicted + 1
+    self.assert_impalad_log_contains("INFO", symbol_emitter_ok_msg, 
expected_num_msg)
 
   @pytest.mark.execute_serially
   @CustomClusterTestSuite.with_args(cluster_size=1,

(impala) 03/03: IMPALA-12306: (Part 2) Make codegen cache tests with symbol emitter more robust

Reply via email to