[ 
https://issues.apache.org/jira/browse/IMPALA-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061146#comment-18061146
 ] 

ASF subversion and git services commented on IMPALA-14771:
----------------------------------------------------------

Commit 2e00e6c839fcf2d2cd814eac4192480d9fa3d265 in impala's branch 
refs/heads/master from Yida Wu
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2e00e6c83 ]

IMPALA-14771: Fix DCHECK hit due to dangling reference in admission queue

This patch fixes two related issues. First, tests using
admission_control_rpc_compress_threshold_bytes were not applying the
flag correctly at cluster startup, so the compressed path was never
exercised. This is fixed by adding a helper in the tests to properly
inject the flag into impalad arguments.

Second, once compression was correctly enabled, during the tests, a
DCHECK was triggered in DequeueLoop when evaluating queued queries
with compressed execution requests. This happened because
SubmitForAdmission() calls ClearDecompressedCache() to free the
decompressed TQueryExecRequest while the query is queued, but
the group states (ScheduleState objects) still held references to
that freed request. When TryDequeue() later evaluated the query,
it accessed these dangling references and hit the DCHECK.

The fix clears group states immediately after clearing the
decompression cache and updates the schedule recompute logic so
group states are rebuilt when the query is dequeued.

Tests:
Passed test_admission_controller.py exhaustive tests.

Change-Id: I969e4f32b6838d305c317d0a75f17211f75eed57
Reviewed-on: http://gerrit.cloudera.org:8080/24024
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Admissiond crash in DequeueLoop caused by dangling ScheduleState
> ----------------------------------------------------------------
>
>                 Key: IMPALA-14771
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14771
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 5.0.0
>            Reporter: Yida Wu
>            Assignee: Yida Wu
>            Priority: Major
>             Fix For: Impala 5.0.0
>
>
> IMPALA-14661 adds support for compressing admission requests, but the test 
> TestAdmissionControllerWithACService was not correctly applying the start 
> flag matrix, so compression was not actually enabled during testing.
> After fixing the test, we found that admissiond can hit a 
> [DCHECK|https://github.com/apache/impala/blob/master/be/src/scheduling/schedule-state.cc#L215C3-L215C50]
>  in DequeueLoop due to a dangling ScheduleState after 
> ClearDecompressedCache().
> {code:java}
> #3  0x000000000187b3fe in impala::ScheduleState::GetPerExecutorMemoryEstimate 
> (this=this@entry=0xc03f800) at 
> /impala/Impala/be/src/scheduling/schedule-state.cc:215
> #4  0x000000000187bce5 in impala::ScheduleState::UpdateMemoryRequirements 
> (this=this@entry=0xc03f800, pool_cfg=..., 
> coord_mem_limit_admission=12884901888, 
> executor_mem_limit_admission=12884901888)
>     at /impala/Impala/be/src/scheduling/schedule-state.cc:329
> #5  0x0000000001812328 in 
> impala::AdmissionController::FindGroupToAdmitOrReject 
> (this=this@entry=0x951fc00, membership_snapshot=..., pool_config=..., 
> root_cfg=..., admit_from_queue=admit_from_queue@entry=true, 
> pool_stats=pool_stats@entry=0xc5c1bb0, 
>     queue_node=0xc622730, coordinator_resource_limited=@0x7f3991e415fe: 
> false, is_trivial=0x7f3991e415ff) at 
> /impala/Impala/be/src/scheduling/admission-controller.cc:2501
> #6  0x0000000001812cf4 in impala::AdmissionController::TryDequeue 
> (this=this@entry=0x951fc00) at 
> /impala/Impala/be/src/scheduling/admission-controller.cc:2686
> #7  0x000000000181497a in impala::AdmissionController::DequeueLoop 
> (this=0x951fc00) at 
> /impala/Impala/be/src/scheduling/admission-controller.cc:2646
> #8  0x0000000001816f83 in boost::_mfi::mf0<void, 
> impala::AdmissionController>::operator() (p=<optimized out>, this=<optimized 
> out>) at 
> /impala/Impala/toolchain/toolchain-packages-gcc10.4.0/boost-1.74.0-p1/include/boost/bind/mem_fn_template.hpp:49
> {code}
> The reason is that ScheduleState [depends on the decompressed 
> TQueryExecRequest|https://github.com/apache/impala/blob/master/be/src/scheduling/admission-controller.cc#L2424].
>  When a query is enqueued, [ClearDecompressedCache() is 
> called|https://github.com/apache/impala/blob/master/be/src/scheduling/admission-controller.cc#L1793]
>  to save memory, which frees the decompressed exec request. However, 
> queue_node->group_states still holds ScheduleState objects that reference 
> this freed request.
> When dequeuing, these stale objects are reused and cause a DCHECK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to