[
https://issues.apache.org/jira/browse/CALCITE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated CALCITE-3820:
------------------------------------
Labels: pull-request-available (was: )
> EnumerableDefaults#orderBy should be lazily computed + support enumerator
> re-initialization
> -------------------------------------------------------------------------------------------
>
> Key: CALCITE-3820
> URL: https://issues.apache.org/jira/browse/CALCITE-3820
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.21.0
> Reporter: Ruben Q L
> Assignee: Ruben Q L
> Priority: Major
> Labels: pull-request-available
>
> Currently, {{EnumerableDefaults#orderBy}} (which is the actual implementation
> behind EnumerableSort) looks like this:
> {code:java}
> public static <TSource, TKey> Enumerable<TSource> orderBy(
> Enumerable<TSource> source, Function1<TSource, TKey> keySelector,
> Comparator<TKey> comparator) {
> final Map<TKey, List<TSource>> map = new TreeMap<>(comparator);
> LookupImpl<TKey, TSource> lookup = toLookup_(map, source, keySelector,
> Functions.identitySelector());
> return lookup.valuesEnumerable();
> }
> {code}
> This implementation has the following disadvantages:
> - 1) The TreeMap that performs the actual sorting is eagerly computed. This
> operation can be quite heavy if we have a very big source. The fact that it
> is eagerly computed might lead to bad performance in situations where the
> sort is executed, but it is actually not needed. For example, in a
> NestedLoopJoin, if the outer enumerator returns no results, the inner one is
> not even accessed. If we had a huge orderBy as inner, today it would be
> computed in that scenario, even though it is not actually required. For this
> reason, in terms of performance it seems clearly preferable to delay the sort
> operation as much as possible.
> - 2) The Enumerable / Enumerator returned by {{EnumerableDefaults#orderBy}}
> cannot be re-evaluated. Since the map, and the subsequent LookupImpl, the
> Enumerable relies on is eagerly computed, every called to {{enumerator}} will
> return the same values, *even if the source Enumerable has changed*. This is
> a corner case, but we can see it if, for example, we have a MergeJoin, with
> EnumerableSort (i.e. using {{EnumerableDefaults#orderBy}}) inside a
> RepeatUnion (a.k.a. recursive union). In this situation, the RepeatUnion
> re-evaluates its right child, so that the output of the iteration N-1 is used
> as input for the iteration N. In this scenario, the
> {{EnumerableDefaults#orderBy}} will not work as expected, since it will not
> be actually re-computed (see {{EnumerableMethods#repeatUnion}})
--
This message was sent by Atlassian Jira
(v8.3.4#803005)