Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21541 )

Change subject: IMPALA-12906: Incorporate scan range information into the tuple 
cache key
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/exec/tuple-cache-node.h
File be/src/exec/tuple-cache-node.h:

http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/exec/tuple-cache-node.h@21
PS4, Line 21: #include <vector>
> Why include vector in this header?
Good point, that must be leftover from earlier iterations. Removed.


http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/scheduling/scheduler.cc@600
PS4, Line 600:   bool operator<(InstanceAssignment& other) const { return 
weight > other.weight; }
> Make this deterministic for equal weight case by comparing instance_idx.
Done


http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/scheduling/scheduler.cc@816
PS4, Line 816:             return ScanRangeWeight(a) > ScanRangeWeight(b);
> Make this deterministic for equal weight
Done


http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/scheduling/scheduler.cc@821
PS4, Line 821:         pop_heap(instance_heap.begin(), instance_heap.end());
> Kind of hard to follow why pop_heap/push_heap is being used here instead of
This is the pre-IMPALA-9655 code that I brought back. All the instances start 
with zero work assigned and so the initial vector is already sorted, because it 
only cared about the weight and all the weights are the same. make_heap() is 
not needed for a sorted vector.

With the new code to make it deterministic, the comparison now cares about the 
instance idx. This still constructs it in a sorted order, so it doesn't require 
make_heap(). To make this clearer, I added a DCHECK to verify it is a heap via 
std::is_heap() and a comment about why it is already a heap.

This uses a heap, because we only care about finding the instance with the 
least work assigned. We don't need the vector to be fully sorted. A heap is 
O(log(n)) for each push/pop. Fully sorting the vector would be more expensive.


http://gerrit.cloudera.org:8080/#/c/21541/4/fe/src/main/java/org/apache/impala/planner/TupleCacheNode.java
File fe/src/main/java/org/apache/impala/planner/TupleCacheNode.java:

http://gerrit.cloudera.org:8080/#/c/21541/4/fe/src/main/java/org/apache/impala/planner/TupleCacheNode.java@128
PS4, Line 128:     output.append(detailPrefix + "input scan node ids: " +
> If this list will typically be long, it should be split at keyFormatWidth
Right now, this will always be a single entry. When we add support for unions, 
it can have an entry for each branch of the union. It gets reset at an exchange 
boundary, so it doesn't accumulate up the plan tree. In future, replicated 
dimension tables could also cause multiple entries.

Since these are one-three digit integers, I think we will be fine without 
wrapping for now.



--
To view, visit http://gerrit.cloudera.org:8080/21541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe298fff0f644ce931a2aa934ebb98f69aab9d34
Gerrit-Change-Number: 21541
Gerrit-PatchSet: 4
Gerrit-Owner: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Tue, 20 Aug 2024 19:38:43 +0000
Gerrit-HasComments: Yes

Reply via email to