Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/23613 )
Change subject: IMPALA-14092 Part2: Support querying of paimon data table via JNI ...................................................................... Patch Set 9: (53 comments) http://gerrit.cloudera.org:8080/#/c/23613/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23613/9//COMMIT_MSG@39 PS9, Line 39: 2.x Did the use of static_cast instead of dynamic_cast improve it further? http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.h File be/src/exec/paimon/paimon-jni-row-reader.h: http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.h@49 PS9, Line 49: a arrow batchs nit: "a row denoted by 'row_idx' from an arrow batch" http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.h@54 PS9, Line 54: /// Templates to cast the value of a arrow batch simple primitive to the : /// matching Impala type and writes it into the target slot. nit: Template method that writes a value from 'arraw_array' to 'slot'. 'T' is the type of the slot, 'AT' is the proper subtype of the arrow::Array. http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc File be/src/exec/paimon/paimon-jni-row-reader.cc: http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc@270 PS8, Line 270: > TIMESTAMP WITH LOCAL TIME ZONE have tested locally, the result is compared I see. It's possible to set the local timezone via query option 'TIMEZONE'. We do it in a few tests, you can just "git grep 'set TIMEZONE'" http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc File be/src/exec/paimon/paimon-jni-row-reader.cc: http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc@50 PS9, Line 50: nit: redundant space http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc@188 PS9, Line 188: ( nit: no need for this parenthesis http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc@220 PS9, Line 220: nit: redundant space http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc@233 PS9, Line 233: std::string &tz_name = type.timezone(); : const Timezone * nit: we put */& next to the type (without a space), e.g.: const std::string& tz_name = ... const Timezone* tz = ... http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc File be/src/exec/paimon/paimon-jni-scan-node.cc: http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@54 PS9, Line 54: ScanPrepareTime It measures the time spent in Open(). Shouldn't it be called ScanOpenTime? http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@86 PS9, Line 86: nit: needs +4 indent http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@143 PS9, Line 143: /// update allocated offheap memory reported from Jni. : if (offheap_consumed_bytes > paimon_last_arrow_record_batch_consumed_bytes_) { : OffheapTrackFree(); : OffheapTrackAllocation(offheap_consumed_bytes); : } Any reason why we only update the tracked offheap memory usage when the new arrow record batch consumes more memory? http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@151 PS9, Line 151: .ValueOrDie(); >From the arrow documentation: * This method should only be called if this Result object’s status is OK (i.e. a call to ok() returns true), otherwise this call will abort. http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@153 PS9, Line 153: DCHECK nit: DCHECK_EQ should be used instead which outputs the actual values. http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@163 PS9, Line 163: ( nit: no need for the parenthesis http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.h File be/src/exec/paimon/paimon-jni-scanner.h: http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.h@56 PS9, Line 56: inline nit: inline keywords are redundant http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.cc File be/src/exec/paimon/paimon-jni-scanner.cc: http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.cc@76 PS9, Line 76: nit: needs +4 indent http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.cc@76 PS9, Line 76: long * rows, long *offheap_used nit: long* rows, long* offheap_used http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.cc@77 PS9, Line 77: . nit: ", and the offheap usage in bytes" http://gerrit.cloudera.org:8080/#/c/23613/9/bin/bootstrap_toolchain.py File bin/bootstrap_toolchain.py: http://gerrit.cloudera.org:8080/#/c/23613/9/bin/bootstrap_toolchain.py@480 PS9, Line 480: , "arrow" nit: the packages are in lexicographic order http://gerrit.cloudera.org:8080/#/c/23613/9/bin/impala-config.sh File bin/impala-config.sh: http://gerrit.cloudera.org:8080/#/c/23613/9/bin/impala-config.sh@210 PS9, Line 210: 13 Seems like the newest stable version is 22. Any reason we use version 13? Even Paimon uses Arrow 15 and 18 based on their pom.xml. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/Table.java File fe/src/main/java/org/apache/impala/catalog/Table.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/Table.java@673 PS9, Line 673: nit: indentation is off http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/Table.java@676 PS9, Line 676: e nit: missing space http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalPaimonTable.java File fe/src/main/java/org/apache/impala/catalog/local/LocalPaimonTable.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalPaimonTable.java@49 PS9, Line 49: nit: too much indentation here and below http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalPaimonTable.java@68 PS9, Line 68: addVirtualColumns(ref.getVirtualColumns()); Why do we add virtual columns if they cannot be queried? http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java File fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java@60 PS9, Line 60: import org.apache.impala.thrift.TColumnType; Unused import http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/FePaimonTable.java File fe/src/main/java/org/apache/impala/catalog/paimon/FePaimonTable.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/FePaimonTable.java@45 PS9, Line 45: The imports are unused. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonImpalaTypeUtils.java File fe/src/main/java/org/apache/impala/catalog/paimon/PaimonImpalaTypeUtils.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonImpalaTypeUtils.java@212 PS9, Line 212: for (int i = 0; i < fields.size(); i++) { : builder.field(fields.get(i).getName(), visit(fields.get(i).getType(), visitor)); : } Any reason to change this? The original is more concise and readable. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonStructField.java File fe/src/main/java/org/apache/impala/catalog/paimon/PaimonStructField.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonStructField.java@78 PS9, Line 78: ( nit: no need for the parentheses, same goes for next line http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonTable.java File fe/src/main/java/org/apache/impala/catalog/paimon/PaimonTable.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonTable.java@31 PS9, Line 31: import org.apache.impala.catalog.IcebergStructField; Unused import http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonUtil.java File fe/src/main/java/org/apache/impala/catalog/paimon/PaimonUtil.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonUtil.java@886 PS9, Line 886: nit: needs +4 indent http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java File fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@50 PS9, Line 50: * nit: we prefer non-wildcard imports. Same for L25 & L28. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@130 PS9, Line 130: if (s instanceof DataSplit) { : DataSplit dataSplit = (DataSplit) s; : if (dataSplit.mergedRowCountAvailable()) { : return dataSplit.mergedRowCount(); : } else { : return dataSplit.rowCount(); : } : } else { : return s.rowCount(); : } Could be just: if (s instanceof DataSplit) { DataSplit dataSplit = (DataSplit) s; if (dataSplit.mergedRowCountAvailable()) { return dataSplit.mergedRowCount(); } return s.rowCount(); http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@143 PS9, Line 143: Optional<Long> result = splits_.parallelStream().map(this::estimateSplitRowCount) : .reduce(new BinaryOperator<Long>() : { : @Override : public Long apply(Long a, Long b) { : return a+b; : } : }); : if (result.isPresent()) { : return result.get(); : } else { : return -1; : } Could by written as return splits_.parallelStream() .mapToLong(this::estimateSplitRowCount) .reduce(Long::sum) .orElse(-1); Also, I think in this case plain stream() can be good enough, or even more efficient if threading overhead dominates the computation. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@161 PS9, Line 161: 4 nit: magic constant. Please introduce a variable with a descriptive name for this. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@209 PS9, Line 209: Iterables.concat(conjuncts_)); nit: fits earlier line http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@216 PS9, Line 216: * Assume that scan ranges that can be scheduled locally will be, and that scan : * ranges that cannot will be round-robined across the cluster. As of now, scan ranges are scheduled round-robin, since they have no location information. This also means that the method can be simplified significantly, right? I also noticed that a single Paimon split can contain multiple small data files, which is quite different from regular Impala splits. I.e. in some cases it won't be possible to do local reads for the whole Paimon split, because some data files will be local and some data files won't. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java File fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java@33 PS9, Line 33: Iceberg Paimon. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java@35 PS9, Line 35: Iceberg Paimon http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java@34 PS9, Line 34: handling delete files and time travel. We also want to push : * down Impala conjuncts to Iceberg to filter out partitions and data files. This : * class deals with such complexities. This class currently doesn't deal with any of this. They are taken care by the Paimon scanner in the backend. Time travel and predicate pushdown during planning are also coming later. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1915 PS9, Line 1915: nit: +4 spaces is enough, see code above and below http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/paimon/PaimonSplit.java File fe/src/main/java/org/apache/impala/planner/paimon/PaimonSplit.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/paimon/PaimonSplit.java@39 PS9, Line 39: _ nit: no need for he '_' for public methods. Same in L43. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatNativeWriter.java File fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatNativeWriter.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatNativeWriter.java@34 PS9, Line 34: relevant PR Can you please link the PR? http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java File fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java@38 PS9, Line 38: relevant PR please link PR http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java@92 PS9, Line 92: } nit: please add empty line after this http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java@132 PS9, Line 132: _ nit: no need for the '_' for public methods. Same at L136. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java File fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java@37 PS9, Line 37: * nit: we prefer non-wildcard imports http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java@52 PS9, Line 52: relevant PR please link relevant PR http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java@179 PS9, Line 179: serializeToIpc This method is unused http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java File fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java: http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@71 PS9, Line 71: wrtier nit: writer http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@97 PS9, Line 97: nit: needs +4 indent http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@125 PS9, Line 125: boolean result = writer_.write(iterator_.next()); : if (result) { : rows++; : } The for loop here checks of 'i < batchSize_'. But it is also checked by writer_.write(). Any reason to do this at multiple places? http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@151 PS9, Line 151: if (predicates != null && !predicates.isEmpty()) { : readBuilder.withFilter(predicates); : } Currently predicates are always null/empty, right? All conjuncts are evaluated by the C++ scanner. http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@176 PS9, Line 176: % nit: add space -- To view, visit http://gerrit.cloudera.org:8080/23613 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384 Gerrit-Change-Number: 23613 Gerrit-PatchSet: 9 Gerrit-Owner: ji chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: ji chen <[email protected]> Gerrit-Comment-Date: Thu, 27 Nov 2025 15:27:49 +0000 Gerrit-HasComments: Yes
