Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23613 )

Change subject: IMPALA-14092 Part2: Support querying of paimon data table via 
JNI
......................................................................


Patch Set 9:

(53 comments)

http://gerrit.cloudera.org:8080/#/c/23613/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23613/9//COMMIT_MSG@39
PS9, Line 39: 2.x
Did the use of static_cast instead of dynamic_cast improve it further?


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.h
File be/src/exec/paimon/paimon-jni-row-reader.h:

http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.h@49
PS9, Line 49: a arrow batchs
nit: "a row denoted by 'row_idx' from an arrow batch"


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.h@54
PS9, Line 54:   /// Templates to cast the value of a arrow batch simple 
primitive to the
            :   /// matching Impala type and writes it into the target slot.
nit: Template method that writes a value from 'arraw_array' to 'slot'. 'T' is 
the type of the slot, 'AT' is the proper subtype of the arrow::Array.


http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc
File be/src/exec/paimon/paimon-jni-row-reader.cc:

http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc@270
PS8, Line 270:
> TIMESTAMP WITH LOCAL TIME ZONE have tested locally, the result is compared
I see. It's possible to set the local timezone via query option 'TIMEZONE'. We 
do it in a few tests, you can just "git grep 'set TIMEZONE'"


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc
File be/src/exec/paimon/paimon-jni-row-reader.cc:

http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc@50
PS9, Line 50:
nit: redundant space


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc@188
PS9, Line 188:  (
nit: no need for this parenthesis


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc@220
PS9, Line 220:
nit: redundant space


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-row-reader.cc@233
PS9, Line 233: std::string &tz_name = type.timezone();
             :   const Timezone *
nit: we put */& next to the type (without a space), e.g.:

  const std::string& tz_name = ...
  const Timezone* tz = ...


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc
File be/src/exec/paimon/paimon-jni-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@54
PS9, Line 54: ScanPrepareTime
It measures the time spent in Open(). Shouldn't it be called ScanOpenTime?


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@86
PS9, Line 86:
nit: needs +4 indent


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@143
PS9, Line 143:     /// update allocated offheap memory reported from Jni.
             :     if (offheap_consumed_bytes > 
paimon_last_arrow_record_batch_consumed_bytes_) {
             :       OffheapTrackFree();
             :       OffheapTrackAllocation(offheap_consumed_bytes);
             :     }
Any reason why we only update the tracked offheap memory usage when the new 
arrow record batch consumes more memory?


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@151
PS9, Line 151: .ValueOrDie();
>From the arrow documentation:

* This method should only be called if this Result object’s status is OK (i.e. 
a call to ok() returns true), otherwise this call will abort.


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@153
PS9, Line 153: DCHECK
nit: DCHECK_EQ should be used instead which outputs the actual values.


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scan-node.cc@163
PS9, Line 163: (
nit: no need for the parenthesis


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.h
File be/src/exec/paimon/paimon-jni-scanner.h:

http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.h@56
PS9, Line 56: inline
nit: inline keywords are redundant


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.cc
File be/src/exec/paimon/paimon-jni-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.cc@76
PS9, Line 76:
nit: needs +4 indent


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.cc@76
PS9, Line 76: long * rows, long *offheap_used
nit: long* rows, long* offheap_used


http://gerrit.cloudera.org:8080/#/c/23613/9/be/src/exec/paimon/paimon-jni-scanner.cc@77
PS9, Line 77: .
nit: ", and the offheap usage in bytes"


http://gerrit.cloudera.org:8080/#/c/23613/9/bin/bootstrap_toolchain.py
File bin/bootstrap_toolchain.py:

http://gerrit.cloudera.org:8080/#/c/23613/9/bin/bootstrap_toolchain.py@480
PS9, Line 480: , "arrow"
nit: the packages are in lexicographic order


http://gerrit.cloudera.org:8080/#/c/23613/9/bin/impala-config.sh
File bin/impala-config.sh:

http://gerrit.cloudera.org:8080/#/c/23613/9/bin/impala-config.sh@210
PS9, Line 210: 13
Seems like the newest stable version is 22. Any reason we use version 13? Even 
Paimon uses Arrow 15 and 18 based on their pom.xml.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/Table.java
File fe/src/main/java/org/apache/impala/catalog/Table.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/Table.java@673
PS9, Line 673:
nit: indentation is off


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/Table.java@676
PS9, Line 676: e
nit: missing space


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalPaimonTable.java
File fe/src/main/java/org/apache/impala/catalog/local/LocalPaimonTable.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalPaimonTable.java@49
PS9, Line 49:
nit: too much indentation here and below


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalPaimonTable.java@68
PS9, Line 68: addVirtualColumns(ref.getVirtualColumns());
Why do we add virtual columns if they cannot be queried?


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java
File fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/local/LocalTable.java@60
PS9, Line 60: import org.apache.impala.thrift.TColumnType;
Unused import


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/FePaimonTable.java
File fe/src/main/java/org/apache/impala/catalog/paimon/FePaimonTable.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/FePaimonTable.java@45
PS9, Line 45:
The imports are unused.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonImpalaTypeUtils.java
File 
fe/src/main/java/org/apache/impala/catalog/paimon/PaimonImpalaTypeUtils.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonImpalaTypeUtils.java@212
PS9, Line 212:         for (int i = 0; i < fields.size(); i++) {
             :           builder.field(fields.get(i).getName(), 
visit(fields.get(i).getType(), visitor));
             :         }
Any reason to change this? The original is more concise and readable.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonStructField.java
File fe/src/main/java/org/apache/impala/catalog/paimon/PaimonStructField.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonStructField.java@78
PS9, Line 78: (
nit: no need for the parentheses, same goes for next line


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonTable.java
File fe/src/main/java/org/apache/impala/catalog/paimon/PaimonTable.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonTable.java@31
PS9, Line 31: import org.apache.impala.catalog.IcebergStructField;
Unused import


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonUtil.java
File fe/src/main/java/org/apache/impala/catalog/paimon/PaimonUtil.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/catalog/paimon/PaimonUtil.java@886
PS9, Line 886:
nit: needs +4 indent


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java
File fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@50
PS9, Line 50: *
nit: we prefer non-wildcard imports. Same for L25 & L28.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@130
PS9, Line 130:     if (s instanceof DataSplit) {
             :       DataSplit dataSplit = (DataSplit) s;
             :       if (dataSplit.mergedRowCountAvailable()) {
             :         return dataSplit.mergedRowCount();
             :       } else {
             :         return dataSplit.rowCount();
             :       }
             :     } else {
             :       return s.rowCount();
             :     }
Could be just:

    if (s instanceof DataSplit) {
      DataSplit dataSplit = (DataSplit) s;
      if (dataSplit.mergedRowCountAvailable()) {
        return dataSplit.mergedRowCount();
      }
    return s.rowCount();


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@143
PS9, Line 143:     Optional<Long> result = 
splits_.parallelStream().map(this::estimateSplitRowCount)
             :             .reduce(new BinaryOperator<Long>()
             :     {
             :       @Override
             :       public Long apply(Long a, Long b) {
             :         return a+b;
             :       }
             :     });
             :     if (result.isPresent()) {
             :       return result.get();
             :     } else {
             :       return -1;
             :     }
Could by written as

  return splits_.parallelStream()
      .mapToLong(this::estimateSplitRowCount)
      .reduce(Long::sum)
      .orElse(-1);

Also, I think in this case plain stream() can be good enough, or even more 
efficient if threading overhead dominates the computation.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@161
PS9, Line 161: 4
nit: magic constant. Please introduce a variable with a descriptive name for 
this.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@209
PS9, Line 209: Iterables.concat(conjuncts_));
nit: fits earlier line


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanNode.java@216
PS9, Line 216:    * Assume that scan ranges that can be scheduled locally will 
be, and that scan
             :    * ranges that cannot will be round-robined across the cluster.
As of now, scan ranges are scheduled round-robin, since they have no location 
information. This also means that the method can be simplified significantly, 
right?

I also noticed that a single Paimon split can contain multiple small data 
files, which is quite different from regular Impala splits. I.e. in some cases 
it won't be possible to do local reads for the whole Paimon split, because some 
data files will be local and some data files won't.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java@33
PS9, Line 33: Iceberg
Paimon.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java@35
PS9, Line 35: Iceberg
Paimon


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/PaimonScanPlanner.java@34
PS9, Line 34: handling delete files and time travel. We also want to push
            :  * down Impala conjuncts to Iceberg to filter out partitions and 
data files. This
            :  * class deals with such complexities.
This class currently doesn't deal with any of this. They are taken care by the 
Paimon scanner in the backend.

Time travel and predicate pushdown during planning are also coming later.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1915
PS9, Line 1915:    
nit: +4 spaces is enough, see code above and below


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/paimon/PaimonSplit.java
File fe/src/main/java/org/apache/impala/planner/paimon/PaimonSplit.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/planner/paimon/PaimonSplit.java@39
PS9, Line 39: _
nit: no need for he '_' for public methods. Same in L43.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatNativeWriter.java
File 
fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatNativeWriter.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatNativeWriter.java@34
PS9, Line 34: relevant PR
Can you please link the PR?


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java
File 
fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java@38
PS9, Line 38: relevant PR
please link PR


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java@92
PS9, Line 92: }
nit: please add empty line after this


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowFormatWriter.java@132
PS9, Line 132: _
nit: no need for the '_' for public methods. Same at L136.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java
File fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java@37
PS9, Line 37: *
nit: we prefer non-wildcard imports


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java@52
PS9, Line 52: relevant PR
please link relevant PR


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonArrowUtils.java@179
PS9, Line 179: serializeToIpc
This method is unused


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java
File fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java:

http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@71
PS9, Line 71: wrtier
nit: writer


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@97
PS9, Line 97:
nit: needs +4 indent


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@125
PS9, Line 125:         boolean result = writer_.write(iterator_.next());
             :         if (result) {
             :           rows++;
             :         }
The for loop here checks of 'i < batchSize_'. But it is also checked by 
writer_.write(). Any reason to do this at multiple places?


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@151
PS9, Line 151:       if (predicates != null && !predicates.isEmpty()) {
             :         readBuilder.withFilter(predicates);
             :       }
Currently predicates are always null/empty, right? All conjuncts are evaluated 
by the C++ scanner.


http://gerrit.cloudera.org:8080/#/c/23613/9/fe/src/main/java/org/apache/impala/util/paimon/PaimonJniScanner.java@176
PS9, Line 176: %
nit: add space



--
To view, visit http://gerrit.cloudera.org:8080/23613
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384
Gerrit-Change-Number: 23613
Gerrit-PatchSet: 9
Gerrit-Owner: ji chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: ji chen <[email protected]>
Gerrit-Comment-Date: Thu, 27 Nov 2025 15:27:49 +0000
Gerrit-HasComments: Yes

Reply via email to