[jira] [Updated] (CALCITE-3820) EnumerableDefaults#orderBy should be lazily computed + support enumerator re-initialization

Ruben Q L (Jira) Tue, 25 Feb 2020 07:10:34 -0800


     [ 
https://issues.apache.org/jira/browse/CALCITE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ruben Q L updated CALCITE-3820:
-------------------------------
    Description: 
Currently, {{EnumerableDefaults#orderBy}} (which is the actual implementation 
behind EnumerableSort) looks like this:
{code:java}
public static <TSource, TKey> Enumerable<TSource> orderBy(
    Enumerable<TSource> source, Function1<TSource, TKey> keySelector,
    Comparator<TKey> comparator) {
  final Map<TKey, List<TSource>> map = new TreeMap<>(comparator);
  LookupImpl<TKey, TSource> lookup = toLookup_(map, source, keySelector,
      Functions.identitySelector());
  return lookup.valuesEnumerable();
}
{code}
This implementation has the following disadvantages:
- 1) The TreeMap that performs the actual sorting is eagerly computed. This 
operation can be quite heavy if we have a very big source. The fact that it is 
eagerly computed might lead to bad performance in situations where the sort is 
executed, but it is actually not needed. For example, in a NestedLoopJoin, if 
the outer enumerator returns no results, the inner one is not even accessed. If 
we had a huge orderBy as inner, today it would be computed in that scenario, 
even though it is not actually required. For this reason, in terms of 
performance it seems clearly preferable to delay the sort operation as much as 
possible.
- 2) The Enumerable / Enumerator returned by {{EnumerableDefaults#orderBy}} 
cannot be re-evaluated. Since the map, and the subsequent LookupImpl, the 
Enumerable relies on is eagerly computed, every called to {{enumerator}} will 
return the same values, *even if the source Enumerable has changed*. This is a 
corner case, but we can see it if, for example, we have a MergeJoin, with 
EnumerableSort (i.e. using {{EnumerableDefaults#orderBy}}) inside a RepeatUnion 
(a.k.a. recursive union). In this situation, the RepeatUnion re-evaluates its 
right child, so that the output of the iteration N-1 is used as input for the 
iteration N. In this scenario, the {{EnumerableDefaults#orderBy}} will not work 
as expected, since it will not be actually re-computed (see 
{{EnumerableMethods#repeatUnion}})

  was:
Currently, {{EnumerableDefaults#orderBy}} (which is the actual implementation 
behind EnumerableSort) looks like this:
{code:java}
public static <TSource, TKey> Enumerable<TSource> orderBy(
    Enumerable<TSource> source, Function1<TSource, TKey> keySelector,
    Comparator<TKey> comparator) {
  final Map<TKey, List<TSource>> map = new TreeMap<>(comparator);
  LookupImpl<TKey, TSource> lookup = toLookup_(map, source, keySelector,
      Functions.identitySelector());
  return lookup.valuesEnumerable();
}
{code}
This implementation has the following disadvantages:
- 1) The TreeMap that performs the actual sorting is eagerly computed. This 
operation can be quite heavy if we have a very big source. The fact that it is 
eagerly computed might lead to bad performance in situations where the sort is 
executed, but it is actually not needed. For example, in a NestedLoopJoin, if 
the outer enumerator returns no results, the inner one is not even accessed. If 
we had a huge orderBy as inner, today it would be computed in that scenario, 
even though it is not actually required. For this reason, in terms of 
performance it seems clearly preferable to delay the sort operation as much as 
possible.
- 2) The Enumerable / Enumerator returned by {{EnumerableDefaults#orderBy}} 
cannot be re-evaluated. Since the map, and the subsequent LookupImpl, the 
Enumerable relies on is eagerly computed, every called to {{enumerator}} will 
return the same values, *even if the source Enumerable has changed*. This is a 
corner case, but we can see it if, for example, we have a MergeJoin, with 
EnumerableSort (i.e. using {{EnumerableDefaults#orderBy}}) inside a RepeatUnion 
(a.k.a. recursive union). In this situation, the RepeatUnion re-evaluates its 
right child, so that the output of the iteration N-1 is used as input for the 
iteration N. In this scenario, the {{EnumerableDefaults#orderBy}} will not work 
as expected, since it will not be actually re-computed.


> EnumerableDefaults#orderBy should be lazily computed + support enumerator 
> re-initialization
> -------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-3820
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3820
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.21.0
>            Reporter: Ruben Q L
>            Assignee: Ruben Q L
>            Priority: Major
>
> Currently, {{EnumerableDefaults#orderBy}} (which is the actual implementation 
> behind EnumerableSort) looks like this:
> {code:java}
> public static <TSource, TKey> Enumerable<TSource> orderBy(
>     Enumerable<TSource> source, Function1<TSource, TKey> keySelector,
>     Comparator<TKey> comparator) {
>   final Map<TKey, List<TSource>> map = new TreeMap<>(comparator);
>   LookupImpl<TKey, TSource> lookup = toLookup_(map, source, keySelector,
>       Functions.identitySelector());
>   return lookup.valuesEnumerable();
> }
> {code}
> This implementation has the following disadvantages:
> - 1) The TreeMap that performs the actual sorting is eagerly computed. This 
> operation can be quite heavy if we have a very big source. The fact that it 
> is eagerly computed might lead to bad performance in situations where the 
> sort is executed, but it is actually not needed. For example, in a 
> NestedLoopJoin, if the outer enumerator returns no results, the inner one is 
> not even accessed. If we had a huge orderBy as inner, today it would be 
> computed in that scenario, even though it is not actually required. For this 
> reason, in terms of performance it seems clearly preferable to delay the sort 
> operation as much as possible.
> - 2) The Enumerable / Enumerator returned by {{EnumerableDefaults#orderBy}} 
> cannot be re-evaluated. Since the map, and the subsequent LookupImpl, the 
> Enumerable relies on is eagerly computed, every called to {{enumerator}} will 
> return the same values, *even if the source Enumerable has changed*. This is 
> a corner case, but we can see it if, for example, we have a MergeJoin, with 
> EnumerableSort (i.e. using {{EnumerableDefaults#orderBy}}) inside a 
> RepeatUnion (a.k.a. recursive union). In this situation, the RepeatUnion 
> re-evaluates its right child, so that the output of the iteration N-1 is used 
> as input for the iteration N. In this scenario, the 
> {{EnumerableDefaults#orderBy}} will not work as expected, since it will not 
> be actually re-computed (see {{EnumerableMethods#repeatUnion}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (CALCITE-3820) EnumerableDefaults#orderBy should be lazily computed + support enumerator re-initialization

Reply via email to