[
https://issues.apache.org/jira/browse/IMPALA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930463#comment-17930463
]
ASF subversion and git services commented on IMPALA-10338:
----------------------------------------------------------
Commit d682e226d5220e828ce473f7f9a7aa8898c20f32 in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d682e226d ]
IMPALA-10338: Deflake test_queue_reasons_slots in ASAN build.
test_queue_reasons_slots has been flaky in ASAN build for not completing
all of its test queries within 60 seconds from submission time. In the
latest occurrence, the 4th out of 5 queries submitted in parallel does
not get admitted after 60 seconds. It then ends in ERROR state, while
the tests expect it to ends in FINISHED state.
This patch attempt to deflake the tests by increasing slots and MT_DOP
option from 4 to 8 in test_queue_reasons_slots. The test query is a
simple GROUP BY query that is expected to run faster with increased
degree of parallelism.
Testing:
- Loop and pass the test 100x in ASAN build.
Change-Id: I4546aa3ce66c480504685e842a4b610a9a7e01ee
Reviewed-on: http://gerrit.cloudera.org:8080/22530
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> TestAdmissionController.test_queue_reasons_slots flaky because of
> slow/hanging fragment
> ---------------------------------------------------------------------------------------
>
> Key: IMPALA-10338
> URL: https://issues.apache.org/jira/browse/IMPALA-10338
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.0.0
> Reporter: Tim Armstrong
> Assignee: Riza Suminto
> Priority: Major
> Labels: broken-build, flaky, hang
> Attachments: failure-output.txt, impalad.ERROR, impalad.INFO
>
>
> This is on an s3 debug build, commit 5a00a4c06f8ec40a8867dcbc036cf5bb47b8a3be
> {noformat}
> custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots
> (from pytest)
> Failing for the past 1 build (Since Failed#673 )
> Took 1 min 58 sec.
> add description
> Error Message
> Timeout: query 504b3a2511f3cd0e:e27bec6b00000000 did not reach one of the
> expected states [4], last known state 3
> Stacktrace
> custom_cluster/test_admission_controller.py:967: in test_queue_reasons_slots
> TIMEOUT_S, config_options={"mt_dop": 4})
> custom_cluster/test_admission_controller.py:277: in
> _execute_and_collect_profiles
> state = self.wait_for_any_state(handle, expected_states, timeout_s)
> common/impala_test_suite.py:1081: in wait_for_any_state
> actual_state))
> E Timeout: query 504b3a2511f3cd0e:e27bec6b00000000 did not reach one of the
> expected states [4], last known state 3
> {noformat}
> Those numbers are beeswax QUeryStates:
> {noformat}
> enum QueryState {
> CREATED = 0
> INITIALIZED = 1
> COMPILED = 2
> RUNNING = 3
> FINISHED = 4
> EXCEPTION = 5
> }
> {noformat}
> I.e. it appears to have run for > 60 seconds.
> The timing of that query:
> {noformat}
> I1116 13:25:08.449323 32665 impala-server.cc:1242]
> 504b3a2511f3cd0e:e27bec6b00000000] Registered query
> query_id=504b3a2511f3cd0e:e27bec6b00000000
> session_id=874c2c9eaf2ad730:0004c1bb0ba7b4a7
> I1116 13:25:08.449626 32665 Frontend.java:1532]
> 504b3a2511f3cd0e:e27bec6b00000000] Analyzing query: select
> min(ss_wholesale_cost) from tpcds_parquet.store_sales db: default
> ...
> I1116 13:25:08.567667 367 admission-controller.cc:1532]
> 504b3a2511f3cd0e:e27bec6b00000000] Scheduling query
> 504b3a2511f3cd0e:e27bec6b00000000 with membership version 2
> I1116 13:25:08.567767 367 admission-controller.cc:1590]
> 504b3a2511f3cd0e:e27bec6b00000000] Scheduling for executor group:
> default-pool-group1 with 3 executors
> I1116 13:25:08.643026 367 admission-controller.cc:1640]
> 504b3a2511f3cd0e:e27bec6b00000000] Trying to admit query to pool default-pool
> in executor group default-pool-group1 (3 executors)
> ...
> I1116 13:25:49.184185 32432 admission-controller.cc:1811] Admitting from
> queue: query=504b3a2511f3cd0e:e27bec6b00000000
> I1116 13:25:49.184196 32432 admission-controller.cc:1903] For Query
> 504b3a2511f3cd0e:e27bec6b00000000 per_backend_mem_limit set to: -1.00 B
> per_backend_m
> em_to_admit set to: 114.02 MB coord_backend_mem_limit set to: -1.00 B
> coord_backend_mem_to_admit set to: 114.02 MB
> I1116 13:25:49.184350 367 admission-controller.cc:1288]
> 504b3a2511f3cd0e:e27bec6b00000000] Admitted queued query
> id=504b3a2511f3cd0e:e27bec6b00000000
> I1116 13:25:49.184370 367 admission-controller.cc:1289]
> 504b3a2511f3cd0e:e27bec6b00000000] Final: agg_num_running=1,
> agg_num_queued=0, agg_mem_reserved
> =17.25 KB, local_host(local_mem_admitted=342.05 MB, num_admitted_running=1,
> num_queued=0, backend_mem_reserved=17.25 KB, topN_query_stats: queries=[0a42
> 32658ea48bc5:c015646900000000, 5b470ebea7782154:ea22bdb000000000,
> 554d88d8f812e22d:efbaf75200000000, aa4a301189a4d144:fdc17ff700000000],
> total_mem_consum
> ed=17.25 KB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=4,
> min=0, max=17.25 KB, pool_total_mem=17.25 KB, average_per_query=4.31 KB)
> I1116 13:25:49.185214 367 impala-server.cc:2062]
> 504b3a2511f3cd0e:e27bec6b00000000] Registering query locations
> I1116 13:25:49.185261 367 coordinator.cc:149]
> 504b3a2511f3cd0e:e27bec6b00000000] Exec()
> query_id=504b3a2511f3cd0e:e27bec6b00000000 stmt=select min(ss_w
> holesale_cost) from tpcds_parquet.store_sales
> I1116 13:25:49.186172 367 coordinator.cc:473]
> 504b3a2511f3cd0e:e27bec6b00000000] starting execution on 3 backends for
> query_id=504b3a2511f3cd0e:e27bec6
> b00000000
> I1116 13:25:49.189028 32071 control-service.cc:142]
> 504b3a2511f3cd0e:e27bec6b00000000] ExecQueryFInstances():
> query_id=504b3a2511f3cd0e:e27bec6b00000000
> coord=impala-ec2-centos74-m5-4xlarge-ondemand-018e.vpc.cloudera.com:27000
> #instances=5
> I1116 13:25:49.192133 367 coordinator.cc:532]
> 504b3a2511f3cd0e:e27bec6b00000000] started execution on 3 backends for
> query_id=504b3a2511f3cd0e:e27bec6b
> 00000000
> ...
> I1116 13:26:49.278456 32665 impala-server.cc:1320] UnregisterQuery():
> query_id=504b3a2511f3cd0e:e27bec6b00000000
> I1116 13:26:49.278478 32665 coordinator.cc:672] ExecState: query
> id=504b3a2511f3cd0e:e27bec6b00000000 execution cancelled
> I1116 13:26:49.278493 32665 coordinator-backend-state.cc:921]
> query_id=504b3a2511f3cd0e:e27bec6b00000000 target backend=127.0.0.1:27000:
> Sending CancelQueryFInstances rpc
> ...
> {noformat}
> As far as when the instances finished - 13:25:55
> {noformat}
> tarmstrong@tarmstrong-box:~/tmplogs8/impala-asf-master-core-s3-DEBUG-673/logs/custom_cluster_tests/relevant$
> grep -h -R 504b3a2511f3cd0e.*'Instance' *
> I1116 13:25:49.189028 32071 control-service.cc:142]
> 504b3a2511f3cd0e:e27bec6b00000000] ExecQueryFInstances():
> query_id=504b3a2511f3cd0e:e27bec6b00000000
> coord=impala-ec2-centos74-m5-4xlarge-ondemand-018e.vpc.cloudera.com:27000
> #instances=5
> I1116 13:25:55.125044 640 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000002] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000002 #in-flight=4 status=OK
> I1116 13:25:55.136926 642 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000004] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000004 #in-flight=3 status=OK
> I1116 13:25:55.139536 639 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000001] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000001 #in-flight=2 status=OK
> I1116 13:25:55.144346 641 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000003] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000003 #in-flight=1 status=OK
> I1116 13:26:49.278493 32665 coordinator-backend-state.cc:921]
> query_id=504b3a2511f3cd0e:e27bec6b00000000 target backend=127.0.0.1:27000:
> Sending CancelQueryFInstances rpc
> I1116 13:26:49.278884 32665 coordinator-backend-state.cc:921]
> query_id=504b3a2511f3cd0e:e27bec6b00000000 target backend=127.0.0.1:27002:
> Sending CancelQueryFInstances rpc
> I1116 13:26:49.278980 643 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000000] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000000 #in-flight=0 status=CANCELLED:
> Cancelled
> I1116 13:25:49.190251 32005 control-service.cc:142]
> 504b3a2511f3cd0e:e27bec6b00000000] ExecQueryFInstances():
> query_id=504b3a2511f3cd0e:e27bec6b00000000
> coord=impala-ec2-centos74-m5-4xlarge-ondemand-018e.vpc.cloudera.com:27000
> #instances=4
> I1116 13:25:55.015192 645 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000006] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000006 #in-flight=3 status=OK
> I1116 13:25:55.015666 646 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000007] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000007 #in-flight=2 status=OK
> I1116 13:25:55.026870 644 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000005] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000005 #in-flight=1 status=OK
> I1116 13:25:55.034817 647 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000008] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000008 #in-flight=0 status=OK
> I1116 13:25:49.191587 31720 control-service.cc:142]
> 504b3a2511f3cd0e:e27bec6b00000000] ExecQueryFInstances():
> query_id=504b3a2511f3cd0e:e27bec6b00000000
> coord=impala-ec2-centos74-m5-4xlarge-ondemand-018e.vpc.cloudera.com:27000
> #instances=4
> I1116 13:25:55.641626 648 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b00000009] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b00000009 #in-flight=3 status=OK
> I1116 13:25:55.646078 649 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b0000000a] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b0000000a #in-flight=2 status=OK
> I1116 13:25:55.650281 652 query-state.cc:906]
> 504b3a2511f3cd0e:e27bec6b0000000b] Instance completed.
> instance_id=504b3a2511f3cd0e:e27bec6b0000000b #in-flight=1 status=OK
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]