[ 
https://issues.apache.org/jira/browse/CALCITE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated CALCITE-3820:
------------------------------------
    Labels: pull-request-available  (was: )

> EnumerableDefaults#orderBy should be lazily computed + support enumerator 
> re-initialization
> -------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-3820
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3820
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.21.0
>            Reporter: Ruben Q L
>            Assignee: Ruben Q L
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, {{EnumerableDefaults#orderBy}} (which is the actual implementation 
> behind EnumerableSort) looks like this:
> {code:java}
> public static <TSource, TKey> Enumerable<TSource> orderBy(
>     Enumerable<TSource> source, Function1<TSource, TKey> keySelector,
>     Comparator<TKey> comparator) {
>   final Map<TKey, List<TSource>> map = new TreeMap<>(comparator);
>   LookupImpl<TKey, TSource> lookup = toLookup_(map, source, keySelector,
>       Functions.identitySelector());
>   return lookup.valuesEnumerable();
> }
> {code}
> This implementation has the following disadvantages:
> - 1) The TreeMap that performs the actual sorting is eagerly computed. This 
> operation can be quite heavy if we have a very big source. The fact that it 
> is eagerly computed might lead to bad performance in situations where the 
> sort is executed, but it is actually not needed. For example, in a 
> NestedLoopJoin, if the outer enumerator returns no results, the inner one is 
> not even accessed. If we had a huge orderBy as inner, today it would be 
> computed in that scenario, even though it is not actually required. For this 
> reason, in terms of performance it seems clearly preferable to delay the sort 
> operation as much as possible.
> - 2) The Enumerable / Enumerator returned by {{EnumerableDefaults#orderBy}} 
> cannot be re-evaluated. Since the map, and the subsequent LookupImpl, the 
> Enumerable relies on is eagerly computed, every called to {{enumerator}} will 
> return the same values, *even if the source Enumerable has changed*. This is 
> a corner case, but we can see it if, for example, we have a MergeJoin, with 
> EnumerableSort (i.e. using {{EnumerableDefaults#orderBy}}) inside a 
> RepeatUnion (a.k.a. recursive union). In this situation, the RepeatUnion 
> re-evaluates its right child, so that the output of the iteration N-1 is used 
> as input for the iteration N. In this scenario, the 
> {{EnumerableDefaults#orderBy}} will not work as expected, since it will not 
> be actually re-computed (see {{EnumerableMethods#repeatUnion}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to