Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20770 )

Change subject: IMPALA-12426: Query History Table
......................................................................


Patch Set 41:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.h
File be/src/service/workload-management.h:

http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.h@40
PS41, Line 40: wm
Can this renamed with something more explanatory? Perhaps "management" or 
"history"?


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.h@102
PS41, Line 102: std::string
Looks like most columns are primitive types. So this can be replaced with 
TPrimitiveType.
Also clarify in comment if these column types are nullable or not.


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.h@248
PS41, Line 248: if (ctx.record->base_state->stmt_type == TStmtType::SET) {
              :             ctx.sql << "SET";
              :           } else if (ctx.record->base_state->stmt_type == 
TStmtType::QUERY) {
              :             ctx.sql << "QUERY";
              :           } else if (ctx.record->base_state->stmt_type == 
TStmtType::DML) {
              :             ctx.sql << "DML";
              :           } else if (ctx.record->base_state->stmt_type == 
TStmtType::DDL) {
              :             ctx.sql << "DDL";
              :           } else if (ctx.record->base_state->stmt_type == 
TStmtType::EXPLAIN) {
              :             ctx.sql << "EXPLAIN";
              :           } else if (ctx.record->base_state->stmt_type == 
TStmtType::ADMIN_FN) {
              :             ctx.sql << "ADMIN";
              :           } else if (ctx.record->base_state->stmt_type == 
TStmtType::CONVERT) {
              :             ctx.sql << "CONVERT";
              :           } else if (ctx.record->base_state->stmt_type == 
TStmtType::LOAD) {
              :             ctx.sql << "LOAD";
              :           } else if (ctx.record->base_state->stmt_type == 
TStmtType::TESTCASE) {
              :             ctx.sql << "TESTCASE";
              :           } else {
              :             ctx.sql << "N/A";
              :           }
This can be feed directly to StringStream sql. This the how generated 
Types_types.cpp looks like:

const char* _kTStmtTypeNames[] = {                                              
                                                                                
                                                                                
                            
  "QUERY",                                                                      
                                                                                
                                                  
  "DDL",                                                                        
                                                                                
                                                                   
  "DML",                                                                        
                                                                                
                                                                                
    
  "EXPLAIN",                                                                    
                                                                                
                                                                                
                     
  "LOAD",                                                                       
                                                                                
                                                                                
                                      
  "SET",                                                                        
                                                                                
                                                            
  "ADMIN_FN",                                                                   
                                                                                
                                                                             
  "TESTCASE",                                                                   
                                                                                
                                                                                
              
  "CONVERT",                                                                    
                                                                                
                                                                                
                               
  "UNKNOWN"                                                                     
                                                                                
                                                     
};                                                                              
                                                                                
                                                                      
const std::map<int, const char*> 
_TStmtType_VALUES_TO_NAMES(::apache::thrift::TEnumIterator(10, 
_kTStmtTypeValues, _kTStmtTypeNames), ::apache::thrift::TEnumIterator(-1, 
nullptr, nullptr));                                                          
                                                                                
                                                                                
                                                                                
                        
std::ostream& operator<<(std::ostream& out, const TStmtType::type& val) {       
                                                                                
                                                                                
                                         
  std::map<int, const char*>::const_iterator it = 
_TStmtType_VALUES_TO_NAMES.find(val);                                           
                                                                                
         
  if (it != _TStmtType_VALUES_TO_NAMES.end()) {                                 
                                                                                
                                                                       
    out << it->second;                                                          
                                                                                
                                                                                
        
  } else {                                                                      
                                                                                
                                                                                
                         
    out << static_cast<int>(val);                                               
                                                                                
                                                                                
                                          
  }                                                                             
                                                                                
                                                                
  return out;                                                                   
                                                                                
                                                                                
 
}                                                                               
                                                                                
                                                                                
                  
                                                                                
                                                                                
                                                                                
                                   
std::string to_string(const TStmtType::type& val) {                             
                                                                                
                                                         
  std::map<int, const char*>::const_iterator it = 
_TStmtType_VALUES_TO_NAMES.find(val);                                           
                                                                                
                    
  if (it != _TStmtType_VALUES_TO_NAMES.end()) {                                 
                                                                                
                                                                                
   
    return std::string(it->second);                                             
                                                                                
                                                                                
                    
  } else {                                                                      
                                                                                
                                                                                
                                     
    return std::to_string(static_cast<int>(val));                               
                                                                                
                                                           
  }                                                                             
                                                                                
                                                                            
}


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc
File be/src/service/workload-management.cc:

http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@63
PS41, Line 63: if (boost::algorithm::any_of(val.cbegin(), val.cend(), 
boost::is_any_of("\"';\n"))) {
             :     LOG(ERROR) << "Invalid value for --" << name << ": must not 
contain quotes or "
             :         "semicolons.";
             :     return false;
             :   }
             :
             :   if (!val.empty()) return true;
             :
             :   LOG(ERROR) << "Invalid value for --" << name << ": must not be 
empty.";
             :   return false;
What if this constrained further to only contain alphanumeric and underscore 
chars?
ie., pass regex "^[a-zA-Z0-9_]*$".

This validator might be reusable by other backend flag like cluster_id.


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@79
PS41, Line 79:   if (val.find_first_of('\'') != string::npos) {
             :     LOG(ERROR) << "Invalid value for --" << name << ": must not 
contain single quotes.";
             :     return false;
             :   }
             :
             :   if (val.find_first_of('"') != string::npos) {
             :     LOG(ERROR) << "Invalid value for --" << name << ": must not 
contain double quotes.";
             :     return false;
             :   }
             :
             :   if (val.find_first_of('\n') != string::npos) {
             :     LOG(ERROR) << "Invalid value for --" << name << ": must not 
contain newlines.";
             :     return false;
             :   }
             :
             :   return true;
Is this mainly validate if val is a legal POSIX path?
Can this simplified by creating boost::path and check if is_absolute() == True?
https://www.boost.org/doc/libs/1_84_0/libs/filesystem/doc/reference.html#path-is_absolute


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@140
PS41, Line 140: DEFINE_int32_hidden(query_log_shutdown_deadline_s, 30, "Number 
of seconds to wait for "
              :     "the queue of completed queries to be drained to the query 
log table before timing "
              :     "out and continuing the shutdown process. The completed 
queries drain process runs "
              :     "after the shutdown process completes, thus the max 
shutdown time is extended by the "
              :     "value specified in this flag.");
Should this be validated against query_log_write_interval_s?
query_log_write_interval_s < query_log_shutdown_deadline_s.


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@382
PS41, Line 382:  completed_queries_ticker_ = make_unique<TickerSecondsBool>(
              :         FLAGS_query_log_write_interval_s, 
completed_queries_cv_, completed_queries_lock_);
              :     
ABORT_IF_ERROR(completed_queries_ticker_->Start("impala-server",
              :         "completed-queries-ticker"));
Should this moved inside if, right after L379?


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@391
PS41, Line 391: lock_guard<mutex> l(completed_queries_threadstate_mu_);
Add DCHECK after this statement:

DCHECK(completed_queries_thread_state_ != SHUTDOWN);


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@452
PS41, Line 452:       }
After this loop, DCHECK that COMPLETED_QUERIES_QUEUED value >= size of 
queries_to_insert.


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@474
PS41, Line 474:           ImpaladMetrics::COMPLETED_QUERIES_QUEUED->Increment(
              :               queries_to_insert.size() * -1);
DCHECK that COMPLETED_QUERIES_QUEUED >= 0 after this.


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@479
PS41, Line 479: LOG(WARNING) << "failed to write completed queries table=\"" << 
table_name <<
              :               "\" record_count=\"" << queries_to_insert.size() 
<< "\"";
COMPLETED_QUERIES_QUEUED is not decremented here. Are they enqueued again and 
retried later?

If yes, please mention that fact in this WARNING log. And what prevent 
completed_queries_ from being retried indefinitely?


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/service/workload-management.cc@481
PS41, Line 481: LOG(WARNING) << ret_status.GetDetail();
Unnecessary extra indentation.


http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/util/thread.h
File be/src/util/thread.h:

http://gerrit.cloudera.org:8080/#/c/20770/41/be/src/util/thread.h@229
PS41, Line 229: /// Can be used to track the lifecycle of a thread.
              : enum ThreadState {
There are some specific *ThreadState defined already such as 
ScannerThreadState, LIRSThreadState.
Maybe this should be kept specific to workload-management (ie., renamed to 
ManagementTreadState) and moved to workload-management.h


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py
File tests/custom_cluster/test_query_log.py:

http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@33
PS41, Line 33: class TestQueryLogTable(CustomClusterTestSuite):
             :   """Tests to assert the query log table is correctly 
populated."""
             :
             :   WM_DB = "sys"
             :   QUERY_TBL = "{0}.impala_query_log".format(WM_DB)
             :   PROTOCOL_BEESWAX = ["beeswax"]
             :   PROTOCOL_HS2 = ["hs2"]
             :   PROTOCOL_ALL = [PROTOCOL_BEESWAX[0], PROTOCOL_HS2[0]]
             :
             :   
@CustomClusterTestSuite.with_args(impalad_args="--enable_workload_mgmt "
             :                                                  
"--query_log_write_interval_s=1 "
             :                                                  
"--cluster_id=test_max_select "
             :                                                  
"--shutdown_grace_period_s=10 "
             :                                                  
"--shutdown_deadline_s=60",
             :                                     
catalogd_args="--enable_workload_mgmt",
             :                                     
impalad_graceful_shutdown=True)
             :   @pytest.mark.parametrize("client_protocol", PROTOCOL_BEESWAX)
             :   def test_query_log_table_almost_max_select(self, 
client_protocol):
Impala test framework use ImpalaTestMatrix to parameterize its test function 
parameters generate permutation of them depending on selected exploration 
strategy (core vs pairwise vs exhaustive). For example, this can be written as 
follow:

class TestQueryLogTableBeeswaxClient(ImpalaTestSuite):                          
                                                                                
                                                                                
     
                                                                                
                                                                                
                                                                                
        
  @classmethod                                                                  
                                                                                
                                                                                
                            
  def add_test_dimensions(cls):                                                 
                                                                                
                                                  
    super(TestQueryLogTableBeeswaxClient, cls).add_test_dimensions()            
                                                                                
                                                                 
    cls.ImpalaTestMatrix.add_dimension(                                         
                                                                                
                                                                    
        ImpalaTestDimension('client_protocol', 'beeswax')


  @CustomClusterTestSuite.with_args(impalad_args="--enable_workload_mgmt "
                                                 
"--query_log_write_interval_s=1 "
                                                 "--cluster_id=test_max_select "
                                                 "--shutdown_grace_period_s=10 "
                                                 "--shutdown_deadline_s=60",
                                    catalogd_args="--enable_workload_mgmt",
                                    impalad_graceful_shutdown=True)
  def test_query_log_table_almost_max_select(self, vector):
    client_protocol = vector.get_value('client_protocol')
    ...

If testing against all client, then the test class can be expressed like this:

class TestQueryLogTableAllClients(ImpalaTestSuite):                             
                                                                                
                                                                            
                                                                                
                                                                                
                                                                                
  
  @classmethod                                                                  
                                                                                
                                                                                
                      
  def add_test_dimensions(cls):                                                 
                                                                                
                                                                                
                                       
    super(TestQueryLogTableAllClients, cls).add_test_dimensions()
    cls.ImpalaTestMatrix.add_dimension(                                         
                                                                                
                                                                
        ImpalaTestDimension('client_protocol', *PROTOCOL_ALL)

That being said, this TestQueryLogTable can be split into 3 test classes based 
on client protocol they are testing: TestQueryLogTableBeeswaxClient, 
TestQueryLogTableHS2Client, and TestQueryLogTableAllClients.

test_query_log_table_query_select can also be contained into its own test class 
because it has buffer_pool_limit dimension.


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@80
PS41, Line 80: assert data[0] == "16774084"
             :       assert data[1] == "935"
Add assert message, mention what indices 0 and 1  are about.


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@168
PS41, Line 168: @pytest.mark.parametrize("client_protocol", PROTOCOL_ALL)
              :   def test_query_log_table_dml(self, client_protocol):
              :     """Asserts the values written to the query log table match 
the values from the
              :        query profile."""
Can this merged with test_query_log_table_ddl? This looks like a superset of 
test_query_log_table_ddl.


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@199
PS41, Line 199: SCRATCH_DIR
Move this constant above, near class name.


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@411
PS41, Line 411: CACHE_DIR
Move this constant above, near class name.


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@431
PS41, Line 431: # Create the test table.
              :       create_tbl_sql = "create table {0} (id INT, product_name 
STRING) " \
              :         "partitioned by (category INT)".format(tbl_name)
              :       create_tbl_results = client.execute(create_tbl_sql)
              :       assert create_tbl_results.success
              :
              :       # Insert some rows into the test table.
              :       insert_sql = "insert into {0} (id,category,product_name) 
VALUES ".format(tbl_name)
              :       for i in range(1, 11):
              :         for j in range(1, 11):
              :           if i * j > 1:
              :             insert_sql += ","
              :
              :           random_product_name = 
"".join(choice(string.ascii_letters)
              :             for _ in range(10))
              :           insert_sql += "({0},{1},'{2}')".format((i * j), i, 
random_product_name)
              :
              :       insert_results = client.execute(insert_sql)
              :       assert insert_results.success
Can this test use existing table such as functional.alltypestiny?


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@473
PS41, Line 473: data[42]
Explain what data[42] is.


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@835
PS41, Line 835: 7
nit: "timeout=7", for clarity.


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@839
PS41, Line 839: FLUSH_INTERNAL_CLUSTER_ID = "test_query_log_max_records_" + 
str(int(time()))
              :   QUERY_COUNT = 2
Move these constant above, near class name. Maybe name them specific to test 
method using it.

Also, why use "str(int(time()))" for cluster_id? Can this and other test use 
fixed cluster_id?


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/custom_cluster/test_query_log.py@908
PS41, Line 908: OTHER_TBL = "completed_queries_table_{0}".format(int(time()))
Move constant above, near class name. Maybe name them specific to test method 
using it.


http://gerrit.cloudera.org:8080/#/c/20770/41/tests/util/workload_management.py
File tests/util/workload_management.py:

http://gerrit.cloudera.org:8080/#/c/20770/41/tests/util/workload_management.py@74
PS41, Line 74: data[index]
Array data is returned to caller of assert_query() function.
Can data changed from array to map of {column_label: value} for easier 
inspection by caller?



--
To view, visit http://gerrit.cloudera.org:8080/20770
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2d2da9d450fba4e789400cfa62927fc25d34f844
Gerrit-Change-Number: 20770
Gerrit-PatchSet: 41
Gerrit-Owner: Jason Fehr <[email protected]>
Gerrit-Reviewer: Andrew Sherman <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Comment-Date: Mon, 11 Mar 2024 21:33:55 +0000
Gerrit-HasComments: Yes

Reply via email to