Gonzalo Ortiz created CALCITE-6467:
--------------------------------------

             Summary: Performance of RelMdUtil.checkInputForCollationAndLimit 
when using `where col in (large literal set)`
                 Key: CALCITE-6467
                 URL: https://issues.apache.org/jira/browse/CALCITE-6467
             Project: Calcite
          Issue Type: Improvement
    Affects Versions: 1.37.0, 1.36.0, 1.35.0, 1.34.0, 1.33.0, 1.32.0, 1.38.0
            Reporter: Gonzalo Ortiz
         Attachments: MultistageEngineQuickStart_2024_07_12_111558.jfr, 
image-2024-07-12-15-58-33-504.png

Recently we have updated Pinot to use Calcite 1.37. Previously we were using 
1.31.

After the upgrade, we have found issues when executing some queries that 
include large IN clauses. Queries like:
{code:java}
explain plan for
SELECT DestCityName
FROM (
         SELECT DestCityName
         FROM airlineStats
         WHERE DestCityName IN (
                        'a1', 'a2', 'a3', ... 'a300'
             )
         GROUP BY DestCityName
     ) as a
{code}
After some debug, we have found that the issue is in one of our custom rules (
PinotSortExchangeCopyRule) when we call 
`RelMdUtil.checkInputForCollationAndLimit` 
([link|https://github.com/apache/pinot/blob/dacc6d06907c44e83721454f1090e5f00c824f15/pinot-query-planner/src/main/java/org/apache/pinot/calcite/rel/rules/PinotSortExchangeCopyRule.java#L64]).
 
Comparing different Pinot versions, I've found out that the change that is 
causing problems in our scenario is #CALCITE-5036 (see
 
[https://github.com/apache/calcite/pull/2743)|https://github.com/apache/calcite/pull/2743/files).].
 Specifically, it looks like the problem appears when `RelMdPredicates` tries 
to simplify the expression, which at the time is basically an `OR(DestCityName 
= a1, DestCityName = a2, ...)`. `RexSimplify.simplifyOrTerms` negates previous 
terms in order to apply possible optimizations, but in cases like this where we 
have hundreds of literals, that is very expensive. Going more in detail, it 
looks like most of the time is invested in creating new ranges:
!image-2024-07-12-15-58-33-504.png!
 
You may find more insights in the attached JFR file
[^MultistageEngineQuickStart_2024_07_12_111558.jfr][^MultistageEngineQuickStart_2024_07_12_111558.jfr]
 
I've tried to reproduce the problem with `sqline` but I wasn't able to do so. 
As far as I can see in the code, RelMdUtil.checkInputForCollationAndLimit is 
only called in SortJoinCopyRule, SortJoinTransposeRule and 
SortUnionTransposeRule. I've tried to create a test or JMH benchmark in Calcite 
to try to reproduce the issue, but I don't know codebase well enough.
 
I don't consider myself an expert on Apache Calcite and I know we are not using 
Calcite in the most standard way (we are slowly migrating from our own engine 
to Calcite), but I'm pretty confident this issue may also affect other Calcite 
usages. At least in the trace I cannot see anything Pinot specific.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to