[ 
https://issues.apache.org/jira/browse/IMPALA-12057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004955#comment-18004955
 ] 

ASF subversion and git services commented on IMPALA-12057:
----------------------------------------------------------

Commit 8d56eea72518aa11a36aa086dc8961bc8cdbd1fd in impala's branch 
refs/heads/master from Yida Wu
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8d56eea72 ]

IMPALA-12057: Track removed coordinators to reject queued queries early

Queries in global admission control can remain queued for a long time
if they are assigned to a coordinator that has already left the
cluster. Admissiond can't distinguish between a coordinator that
hasn’t yet been propagated via the statestore and one that has
already been removed, resulting in unnecessary waiting until timeout.
This timeout is determined by either FLAGS_queue_wait_timeout_ms or
the queue_timeout_ms in the pool config. By default,
FLAGS_queue_wait_timeout_ms is 1 minute, but in production it's
normally configured to 10 to 15 minutes.

This change tracks recently removed coordinators and rejects such
queued queries immediately using REASON_COORDINATOR_REMOVED.
To ensure the removed coordinator list remains simple and bounded,
it avoids duplicate entries and enforces FIFO eviction at
the minimum of MAX_REMOVED_COORD_SIZE (1000) and
FLAGS_cluster_membership_retained_removed_coords.

It's possible that a coordinator marked as removed comes back
with the same backend id. In that case, admissiond will see it in
current_backends and won't need to check the removed list. Even
if a coordinator briefly flaps and a request is rejected, it's not
critical, the coordinator can retry. So to keep the design simple
and safe, we keep the removed coord entry as-is.

Added a parameter is_admissiond to the ClusterMembershipMgr
constructor to indicate whether it is running within the admissiond.

Tests:
Passed exhaustive tests.
Added unit tests to verify the eviction logic and the duplicate
case.
Added regression test test_coord_not_registered_in_ac.

Change-Id: I1e0f270299f8c20975d7895c17f4e2791c3360e0
Reviewed-on: http://gerrit.cloudera.org:8080/23094
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> admissiond fails to admit queued queries if coordinator's membership id 
> changes
> -------------------------------------------------------------------------------
>
>                 Key: IMPALA-12057
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12057
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Abhishek Rawat
>            Priority: Critical
>
> If coordinator's subscription id changes (due to a restart or reconnection 
> with statestored), admissiond has no way of knowing if the coordinator was 
> briefly disconnected and is again part of the cluster and has the query state 
> preserved or coordinator got restarted and doesn't know anything about the 
> queued query.
> Ideally in such cases admissiond should learn from coordinator and 
> statestored that the queued queries are still valid and the subscription id 
> has changed so that admission controller can submit the queued queries.
> Untill we support that we should at least fail these queries immediately. The 
> current behavior is that admission controller goes into an infinite loop 
> waiting on these queued queries:
> {code:java}
> I0411 13:52:22.694419    67 admission-controller.cc:2206] Could not dequeue 
> query id=c748095c589ccfb6:3819937100000000 reason: Coordinator not registered 
> with the statestore.
> I0411 13:52:22.795398    67 admission-controller.cc:2206] Could not dequeue 
> query id=c748095c589ccfb6:3819937100000000 reason: Coordinator not registered 
> with the statestore.
> ....
> I0411 15:14:11.063143    67 admission-controller.cc:2206] Could not dequeue 
> query id=c748095c589ccfb6:3819937100000000 reason: Coordinator not registered 
> with the statestore.
> I0411 15:14:11.164698    67 admission-controller.cc:2206] Could not dequeue 
> query id=c748095c589ccfb6:3819937100000000 reason: Coordinator not registered 
> with the statestore. {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to