Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/21541 )
Change subject: IMPALA-12906: Incorporate scan range information into the tuple cache key ...................................................................... Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/exec/tuple-cache-node.h File be/src/exec/tuple-cache-node.h: http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/exec/tuple-cache-node.h@21 PS4, Line 21: #include <vector> > Why include vector in this header? Good point, that must be leftover from earlier iterations. Removed. http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/scheduling/scheduler.cc@600 PS4, Line 600: bool operator<(InstanceAssignment& other) const { return weight > other.weight; } > Make this deterministic for equal weight case by comparing instance_idx. Done http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/scheduling/scheduler.cc@816 PS4, Line 816: return ScanRangeWeight(a) > ScanRangeWeight(b); > Make this deterministic for equal weight Done http://gerrit.cloudera.org:8080/#/c/21541/4/be/src/scheduling/scheduler.cc@821 PS4, Line 821: pop_heap(instance_heap.begin(), instance_heap.end()); > Kind of hard to follow why pop_heap/push_heap is being used here instead of This is the pre-IMPALA-9655 code that I brought back. All the instances start with zero work assigned and so the initial vector is already sorted, because it only cared about the weight and all the weights are the same. make_heap() is not needed for a sorted vector. With the new code to make it deterministic, the comparison now cares about the instance idx. This still constructs it in a sorted order, so it doesn't require make_heap(). To make this clearer, I added a DCHECK to verify it is a heap via std::is_heap() and a comment about why it is already a heap. This uses a heap, because we only care about finding the instance with the least work assigned. We don't need the vector to be fully sorted. A heap is O(log(n)) for each push/pop. Fully sorting the vector would be more expensive. http://gerrit.cloudera.org:8080/#/c/21541/4/fe/src/main/java/org/apache/impala/planner/TupleCacheNode.java File fe/src/main/java/org/apache/impala/planner/TupleCacheNode.java: http://gerrit.cloudera.org:8080/#/c/21541/4/fe/src/main/java/org/apache/impala/planner/TupleCacheNode.java@128 PS4, Line 128: output.append(detailPrefix + "input scan node ids: " + > If this list will typically be long, it should be split at keyFormatWidth Right now, this will always be a single entry. When we add support for unions, it can have an entry for each branch of the union. It gets reset at an exchange boundary, so it doesn't accumulate up the plan tree. In future, replicated dimension tables could also cause multiple entries. Since these are one-three digit integers, I think we will be fine without wrapping for now. -- To view, visit http://gerrit.cloudera.org:8080/21541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibe298fff0f644ce931a2aa934ebb98f69aab9d34 Gerrit-Change-Number: 21541 Gerrit-PatchSet: 4 Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Tue, 20 Aug 2024 19:38:43 +0000 Gerrit-HasComments: Yes
