[
https://issues.apache.org/jira/browse/CALCITE-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028519#comment-18028519
]
Viggo Chen commented on CALCITE-7213:
-------------------------------------
[~julianhyde] I completely agree with the core of your argument: converting all
recursive algorithms in Calcite to be iterative is a massive undertaking and
not something to be taken lightly. As you rightly pointed out, this isn't an
isolated problem in one specific method. Your prediction is spot on. When I
reduced the number of `OR` conditions from 5000 to 2000, the initial
`StackOverflowError` in `performUnconditionalRewrites` disappeared, but another
one surfaced in a different part of the validation process, this time within
`SqlValidatorImpl$Expander` and `SqlShuttle`.
New Stack Trace (with 2000 `OR` conditions):
{code:java}
java.lang.StackOverflowError
at
org.apache.calcite.sql.util.SqlShuttle$CallCopyingArgHandler.<init>(SqlShuttle.java:110)
at
org.apache.calcite.sql.validate.SqlValidatorImpl$Expander.visitScoped(SqlValidatorImpl.java:7257)
at
org.apache.calcite.sql.validate.SqlScopedShuttle.visit(SqlScopedShuttle.java:54)
... (and so on) {code}
I am not expecting a complete, immediate overhaul of all recursive logic.
However, this brings us to a critical question from a user's perspective: How
can Calcite users protect their production systems from crashing today? Do you
have any suggestion?
> StackOverflowError in SqlValidatorImpl with long OR/AND predicate chains
> ------------------------------------------------------------------------
>
> Key: CALCITE-7213
> URL: https://issues.apache.org/jira/browse/CALCITE-7213
> Project: Calcite
> Issue Type: Bug
> Reporter: Viggo Chen
> Priority: Major
>
> h1. Problem Description:
> Apache Calcite's SQL validation stage is vulnerable to
> `java.lang.StackOverflowError` when processing queries that contain a large
> number of chained logical operators (`AND`/`OR`). This occurs because the
> parser generates a deeply nested, unbalanced Abstract Syntax Tree (AST) for
> these predicates. Subsequent recursive traversal during the validation phase
> exhausts the JVM's call stack, causing the process to fail.
> This issue prevents Calcite from handling complex, often machine-generated,
> SQL queries, which are prevalent in modern data systems.
> Steps to Reproduce:
> The issue can be reliably reproduced with a test case that generates a long
> chain of `OR` conditions. The following Java code creates a query with 5000
> `OR` predicates, which is sufficient to trigger the stack overflow during the
> validation step.
> {code:java}
> // Test case to generate the problematic SQL
> final String whereClause = java.util.stream.IntStream.range(0, 5000)
> .mapToObj(i -> "sal = " + i)
> .collect(java.util.stream.Collectors.joining(" OR "));
> // This will throw StackOverflowError during the validation phase
> sql("SELECT * FROM emp WHERE " + whereClause).ok(); {code}
> Generated SQL:
> {code:java}
> SELECT * FROM emp WHERE sal = 0 OR sal = 1 OR sal = 2 OR ... OR sal = 4999
> {code}
> h1. Stack Trace Analysis:
> A typical stack trace points to a deep recursion within the
> `SqlValidatorImpl` component, demonstrating that the failure happens during
> the validation and semantic analysis of the expression tree.
> {code:java}
> java.lang.StackOverflowError
> at org.apache.calcite.sql.SqlBasicCall.getKind(SqlBasicCall.java:83)
> at
> org.apache.calcite.sql.validate.SqlValidatorImpl.performUnconditionalRewrites(SqlValidatorImpl.java:1458)
> at
> org.apache.calcite.sql.validate.SqlValidatorImpl.performUnconditionalRewrites(SqlValidatorImpl.java:1473)
> at
> org.apache.calcite.sql.validate.SqlValidatorImpl.performUnconditionalRewrites(SqlValidatorImpl.java:1473)
> // ... repeats thousands of times {code}
> h1. Impact on System Stability:
> While a query with thousands of chained predicates may be considered
> suboptimal, the framework's response should be a graceful error (e.g., a
> validation exception), not a catastrophic failure. A
> `java.lang.StackOverflowError` is a critical, non-recoverable error that can
> terminate the executing thread and potentially bring down the entire service.
> This transforms a query-level issue into a severe system stability and
> reliability problem.
>
> h1. Suggested Solutions:
> I propose a two-part approach: an immediate defensive measure to improve
> stability, and long-term solutions to fundamentally fix the issue.
> h2. Part 1: Immediate Defensive Measure (Mitigation)
> Introduce a configurable limit on the maximum depth of the AST. This acts as
> a "safety fuse" to prevent stack overflows.
> - Mechanism: After parsing, or at the very beginning of the validation
> phase, perform a quick, non-recursive depth check of the `SqlNode` tree.
> - Configuration: This depth limit should be configurable (e.g., via
> `SqlParser.Config` or `SqlValidator.Config`). The default value should be set
> to a safe but reasonable number (e.g., 1000).
> Action: If the AST depth exceeds the configured threshold, Calcite should
> immediately throw a standard `SqlParseException` or `SqlValidatorException`
> with a clear error message, such as "Query exceeds maximum expression depth
> limit of [N]".
> This provides users with an immediate way to protect their systems from
> problematic queries without waiting for a full architectural fix.
> h2. Part 2: Fundamental Fixes (Long-Term Solutions)
> To properly support these complex queries, the underlying cause of the deep
> recursion must be addressed.
> 1. Adopt Iterative Traversal (Recommended): The most robust solution is to
> refactor the recursive algorithms within `SqlValidatorImpl` and other core
> components to use an explicit, heap-allocated stack (e.g.,
> `java.util.Deque`). This converts the recursive traversal into an iterative
> one, which is not constrained by the call stack depth and can handle
> expression trees of any complexity. This holistically solves this class of
> problems.
> 2. Balanced Tree Construction: An alternative is to modify the parser's
> behavior. Instead of generating a left-deep tree for chained operators like
> `a OP b OP c OP d`, it could produce a balanced tree: `OP(OP(a, b), OP(c,
> d))`. A balanced tree's depth grows logarithmically (`log2(N)`) with the
> number of operators, which would naturally prevent stack overflows for even a
> very large number of predicates. This could be a configurable parser feature.
> By implementing the defensive measure from Part 1, we can provide immediate
> stability. Following up with one of the fundamental fixes from Part 2 will
> ensure Calcite's long-term resilience and scalability.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)