LubyRuffy opened a new issue, #63824: URL: https://github.com/apache/doris/issues/63824
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version apache/doris:fe-4.1.1 / apache/doris:be-4.1.1 Runtime version reported by FE/BE: ```text doris-4.1.1-rc01-b10073ad9ca ``` Official source tag/commit matched from Docker BE log: ```text 4.1.1 b10073ad9ca17cd5685c4dd3b3ef650f256376d0 ``` Test environment: ```text Host architecture: arm64 Docker client: 28.5.2 Docker server: 29.5.2 FE image: apache/doris:fe-4.1.1, linux/arm64, sha256:318ab41551d884ded601366193d6115ffdf6471e78e28c572c3dfcfa99d2255e BE image: apache/doris:be-4.1.1, linux/arm64, sha256:4905607a194641fb47284b616836766aca283ae85575491c799351a41deec60d ``` ### What's Wrong? A minimal query using nested lambda expressions crashes Doris BE with SIGSEGV. The query does not require any physical table or inserted data. It only builds a one-element array with `array_agg('a')`, then evaluates `array_map(x -> array_count(y -> y = x, ids), ids)`. Minimal crashing query: ```sql WITH base AS ( SELECT array_agg('a') AS ids ) SELECT array_map( x -> array_count(y -> y = x, ids), ids ) AS result FROM base; ``` Expected result: ```text [1] ``` Actual client error: ```text ERROR 1105 (HY000) at line 1: RpcException, msg: send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: io exception, host: 172.30.82.3 ``` The BE process exits. `docker ps -a` shows: ```text doris-minrepro-be Exited (0) 9 seconds ago apache/doris:be-4.1.1 ``` BE log shows the failing query and SIGSEGV: ```text *** Query id: f944e19b5ee14d86-bfd49529a73f9a6a *** *** is nereids: 1 *** *** tablet id: 0 *** *** Aborted at 1779956843 (unix time) try "date -d @1779956843" if you are using GNU date *** *** Current BE git commitID: b10073ad9ca *** *** SIGSEGV address not mapped to object (@0x0) received by PID 762 (TID 1215 OR 0xfffde7f865c0) from PID 0; stack trace: *** ``` Relevant stack frames: ```text doris::signal::(anonymous namespace)::FailureSignalHandler doris::is_column_const doris::PreparedFunctionImpl::default_implementation_for_constant_arguments doris::VectorizedFnCall::_do_execute doris::VLambdaFunctionExpr::execute_column doris::ArrayMapFunction::execute doris::VectorizedFnCall::_do_execute doris::VLambdaFunctionExpr::execute_column doris::ArrayMapFunction::execute doris::PipelineTask::execute ``` This is not only a crash. Nested lambda binding also appears semantically wrong even when the query does not crash: ```sql SELECT array_map(x -> array_count(y -> y = x, ['a']), ['b']) AS should_be_zero; SELECT array_map(x -> array_count(y -> y = x, ['a']), ['a']) AS should_be_one; ``` Expected: ```text should_be_zero [0] should_be_one [1] ``` Actual on 4.1.1: ```text should_be_zero [1] should_be_one [1] ``` So the root issue seems to be nested lambda scoping/capture, and the `array_agg` version turns the same scoping bug into an invalid column access and BE crash. ### What You Expected? The minimal crashing query should return `[1]` and should never crash BE. Nested lambdas should resolve variables according to lexical lambda scope. For example: ```sql SELECT array_map(x -> array_count(y -> y = x, ['a']), ['b']); ``` should return `[0]`, because the outer lambda variable `x` is `'b'`, while the inner array contains only `'a'`. BE should also fail safely with a query error if an expression binding is invalid. A SQL expression should not be able to terminate the BE process. ### How to Reproduce? Run a minimal FE/BE Docker cluster: ```bash docker rm -f doris-minrepro-fe doris-minrepro-be >/dev/null 2>&1 || true docker network rm doris-minrepro >/dev/null 2>&1 || true docker network create --subnet 172.30.82.0/24 doris-minrepro docker run -d \ --name doris-minrepro-fe \ --network doris-minrepro \ --ip 172.30.82.2 \ -p 39030:9030 \ -p 38030:8030 \ -e FE_SERVERS=fe1:172.30.82.2:9010 \ -e FE_ID=1 \ apache/doris:fe-4.1.1 for i in $(seq 1 90); do if docker exec doris-minrepro-fe mysql -uroot -h127.0.0.1 -P9030 -e 'SHOW FRONTENDS'; then break fi sleep 2 done docker run -d \ --name doris-minrepro-be \ --network doris-minrepro \ --ip 172.30.82.3 \ -p 38040:8040 \ -e FE_MASTER_IP=172.30.82.2 \ -e BE_IP=172.30.82.3 \ -e BE_PORT=9050 \ apache/doris:be-4.1.1 for i in $(seq 1 120); do docker exec doris-minrepro-fe mysql -uroot -h127.0.0.1 -P9030 -e 'SHOW BACKENDS' > /tmp/doris-minrepro-backends.out 2>&1 || true if grep -q 'true' /tmp/doris-minrepro-backends.out; then cat /tmp/doris-minrepro-backends.out break fi sleep 2 done ``` Run normal control queries first: ```bash docker exec -i doris-minrepro-fe mysql -uroot -h127.0.0.1 -P9030 <<'SQL' WITH base AS (SELECT array_agg('a') AS ids) SELECT ids FROM base; WITH base AS (SELECT array_agg('a') AS ids) SELECT array_count(y -> y = 'a', ids) AS result FROM base; WITH base AS (SELECT array_agg('a') AS ids) SELECT array_map(x -> x, ids) AS result FROM base; SELECT array_map(x -> array_count(y -> y = x, ['a']), ['a']) AS expected_result; SQL ``` Expected and observed output: ```text ids ["a"] result 1 result ["a"] expected_result [1] ``` Run the crashing query: ```bash docker exec -i doris-minrepro-fe mysql -uroot -h127.0.0.1 -P9030 <<'SQL' WITH base AS ( SELECT array_agg('a') AS ids ) SELECT array_map( x -> array_count(y -> y = x, ids), ids ) AS result FROM base; SQL ``` Observed: ```text ERROR 1105 (HY000) at line 1: RpcException, msg: send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: io exception, host: 172.30.82.3 ``` Collect evidence: ```bash docker ps -a --filter name=doris-minrepro-be --format '{{.Names}}\t{{.Status}}\t{{.Image}}' docker logs doris-minrepro-be 2>&1 | \ grep -En 'Query id|SIGSEGV|FailureSignalHandler|is_column_const|VectorizedFnCall|VLambdaFunctionExpr|ArrayMapFunction|PipelineTask' rm -rf /tmp/doris-minrepro-log docker cp doris-minrepro-be:/opt/apache-doris/be/log /tmp/doris-minrepro-log ``` Optional physical-table reproduction with one row: ```sql CREATE DATABASE IF NOT EXISTS repro; USE repro; DROP TABLE IF EXISTS t_array_lambda; CREATE TABLE t_array_lambda ( es_id VARCHAR(8), fid VARCHAR(8) ) DISTRIBUTED BY HASH(es_id) BUCKETS 1 PROPERTIES ( "replication_num" = "1" ); INSERT INTO t_array_lambda VALUES ('g', 'a'); WITH base AS ( SELECT es_id, array_agg(fid) AS ids FROM t_array_lambda GROUP BY es_id ) SELECT array_map( x -> array_count(y -> y = x, ids), ids ) AS result FROM base; ``` ### Anything Else? ### Source-level investigation The Docker BE log reports commit `b10073ad9ca`, which matches the official `4.1.1` tag target: ```bash git ls-remote https://github.com/apache/doris.git | grep 'refs/tags/4.1.1' # 73552015e7587a04f857cb0257fc6e178958d389 refs/tags/4.1.1 # b10073ad9ca17cd5685c4dd3b3ef650f256376d0 refs/tags/4.1.1^{} ``` From source inspection, the likely root cause is in `ArrayMapFunction`'s handling of `VColumnRef` gaps for nested lambdas. `ArrayMapFunction::execute` collects slot refs from the lambda body, computes `gap`, then recursively applies that gap to all `VColumnRef` nodes under the body: https://github.com/apache/doris/blob/b10073ad9ca17cd5685c4dd3b3ef650f256376d0/be/src/exprs/lambda_function/varray_map_function.cpp#L82-L93 ```cpp _collect_slot_ref_column_id(children[0], output_slot_ref_indexs); if (!output_slot_ref_indexs.empty()) { auto max_id = std::max_element(output_slot_ref_indexs.begin(), output_slot_ref_indexs.end()); gap = *max_id + 1; _set_column_ref_column_id(children[0], gap); } ``` The recursive setter does not appear to stop at nested lambda boundaries: https://github.com/apache/doris/blob/b10073ad9ca17cd5685c4dd3b3ef650f256376d0/be/src/exprs/lambda_function/varray_map_function.cpp#L339-L347 ```cpp void _set_column_ref_column_id(VExprSPtr expr, int gap) const { for (const auto& child : expr->children()) { if (child->is_column_ref()) { auto* ref = static_cast<VColumnRef*>(child.get()); ref->set_gap(gap); } else { _set_column_ref_column_id(child, gap); } } } ``` `VColumnRef::set_gap` only sets the gap once, so an inner lambda variable can inherit the outer lambda gap and cannot be corrected by the inner `ArrayMapFunction` execution: https://github.com/apache/doris/blob/b10073ad9ca17cd5685c4dd3b3ef650f256376d0/be/src/exprs/vcolumn_ref.h#L78-L82 At execution time, `VColumnRef` reads the column by `column_id + gap`: https://github.com/apache/doris/blob/b10073ad9ca17cd5685c4dd3b3ef650f256376d0/be/src/exprs/vcolumn_ref.h#L59-L69 The const overload of `Block::get_by_position` has no runtime boundary check: https://github.com/apache/doris/blob/b10073ad9ca17cd5685c4dd3b3ef650f256376d0/be/src/core/block/block.h#L127-L132 The wrong/out-of-range `ColumnPtr` is then dereferenced in the constant-argument path: - `PreparedFunctionImpl::default_implementation_for_constant_arguments`: https://github.com/apache/doris/blob/b10073ad9ca17cd5685c4dd3b3ef650f256376d0/be/src/exprs/function/function.cpp#L122-L141 - `VectorizedUtils::all_arguments_are_constant`: https://github.com/apache/doris/blob/b10073ad9ca17cd5685c4dd3b3ef650f256376d0/be/src/exec/common/util.hpp#L156-L163 - `is_column_const`: https://github.com/apache/doris/blob/b10073ad9ca17cd5685c4dd3b3ef650f256376d0/be/src/core/column/column.cpp#L80-L82 This matches the observed crash stack. ### Suggested fix direction I think the fix should be in lambda variable scoping, not just in the crash site. 1. Make `_set_column_ref_column_id` and `_collect_slot_ref_column_id` scope-aware. - When traversing the current lambda body, do not recurse into nested `LAMBDA_FUNCTION_EXPR` / nested lambda bodies as if they belonged to the same lambda scope. - Inner lambda parameters should be assigned/resolved by the inner lambda execution context only. 2. Avoid storing mutable execution-specific gap state directly on shared `VColumnRef` nodes when nested lambda expressions can reuse the same expression tree. - Passing the gap through an execution context, or cloning/rebinding the relevant lambda body per scope, would be safer than mutating `VColumnRef::_gap` globally. 3. Add a runtime guard in `VColumnRef::execute_column` or use `safe_get_by_position` before dereferencing. - This would turn the current process crash into a query error. - However, this is only a safety net. It would not fix the wrong-result case shown above. 4. Add regression tests for both cases: ```sql -- Should not crash; should return [1] WITH base AS ( SELECT array_agg('a') AS ids ) SELECT array_map( x -> array_count(y -> y = x, ids), ids ) AS result FROM base; -- Should return [0], not [1] SELECT array_map(x -> array_count(y -> y = x, ['a']), ['b']) AS should_be_zero; ``` ### Production context The production query that first exposed the bug was an account-behavior aggregation using this pattern: ```sql WITH base AS ( SELECT es_id, ARRAY_AGG(fid) AS ids FROM query_log WHERE created_at >= '2026-05-18 15:26:00' AND created_at < '2026-05-18 15:36:00' GROUP BY es_id ) SELECT COUNT(*) AS groups_count, SUM(ARRAY_SIZE(array_map( x -> array_count(y -> y = x, ids), array_distinct(ids) ))) AS bucket_count FROM base; ``` The minimal reproduction above removes the production table, timestamp filter, grouping cardinality, and data volume from the equation. A single `array_agg('a')` is enough to reproduce the BE crash. ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
