[ 
https://issues.apache.org/jira/browse/CALCITE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruben Q L updated CALCITE-3820:
-------------------------------
    Fix Version/s: 1.23.0

> EnumerableDefaults#orderBy should be lazily computed + support enumerator 
> re-initialization
> -------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-3820
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3820
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.21.0
>            Reporter: Ruben Q L
>            Assignee: Ruben Q L
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.23.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, {{EnumerableDefaults#orderBy}} (which is the actual implementation 
> behind EnumerableSort) looks like this:
> {code:java}
> public static <TSource, TKey> Enumerable<TSource> orderBy(
>     Enumerable<TSource> source, Function1<TSource, TKey> keySelector,
>     Comparator<TKey> comparator) {
>   final Map<TKey, List<TSource>> map = new TreeMap<>(comparator);
>   LookupImpl<TKey, TSource> lookup = toLookup_(map, source, keySelector,
>       Functions.identitySelector());
>   return lookup.valuesEnumerable();
> }
> {code}
> This implementation has the following disadvantages:
> - 1) The TreeMap that performs the actual sorting is eagerly computed. This 
> operation can be quite heavy if we have a very big source. The fact that it 
> is eagerly computed might lead to bad performance in situations where the 
> sort is executed, but it is actually not needed. For example, in a 
> NestedLoopJoin, if the outer enumerator returns no results, the inner one is 
> not even accessed. If we had a huge orderBy as inner, today it would be 
> computed in that scenario, even though it is not actually required. For this 
> reason, in terms of performance it seems clearly preferable to delay the sort 
> operation as much as possible.
> - 2) The Enumerable / Enumerator returned by {{EnumerableDefaults#orderBy}} 
> cannot be re-evaluated. Since the map and the subsequent {{LookupImpl}} that 
> the Enumerable relies on are eagerly computed, every called to {{enumerator}} 
> will return the same values, *even if the source Enumerable has changed*. 
> This is a corner case, but we can see it if, for example, we have a 
> MergeJoin, with EnumerableSort (i.e. using {{EnumerableDefaults#orderBy}}) 
> inside a RepeatUnion (a.k.a. recursive union). In this situation, the 
> RepeatUnion re-evaluates its right child, so that the output of the iteration 
> N-1 is used as input for the iteration N. In this scenario, the 
> {{EnumerableDefaults#orderBy}} will not work as expected, since it will not 
> be actually re-computed (see {{EnumerableMethods#repeatUnion}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to