Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24071 )

Change subject: IMPALA-14755:(part 1) Implement Puffin Blob reader and File 
writer
......................................................................


Patch Set 5:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/output-partition.h
File be/src/exec/output-partition.h:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/output-partition.h@132
PS5, Line 132:   std::unique_ptr<PuffinWriteResult> puffin_result =
             :       std::make_unique<PuffinWriteResult>();
nit: Since you always create an object, can this be a normal field? I.e.:

  PuffinWriteResult puffin_result;


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/blob.h
File be/src/exec/puffin/blob.h:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/blob.h@25
PS5, Line 25: namespace impala {
nit: add empty line before namespace impala.


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h
File be/src/exec/puffin/puffin-writer.h:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h@107
PS5, Line 107: using RapidJSON
nit: we probably don't want to mention this implementation detail in the 
function comment


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h@119
PS5, Line 119: = puffin::File();
nit: this seems unnecessary


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc
File be/src/exec/puffin/puffin-writer.cc:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@107
PS5, Line 107:   // We expect the row batch to have two columns: filepath 
(string) and position (bigint)
             :   // Similar to how IcebergBufferedDeleteSink processes rows
The comment fits L111-113 better.


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@122
PS5, Line 122:     std::string 
filepath(reinterpret_cast<char*>(filepath_sv.ptr), filepath_sv.len);
             :
             :     // Extract position (row number)
             :     impala_udf::BigIntVal position_bi = 
position_eval->GetBigIntVal(row);
             :     DCHECK(!position_bi.is_null);
             :     int64_t position = position_bi.val;
             :
             :     // Add the position to the bitmap for this file
             :     // If the file doesn't exist in the map yet, it will be 
created
             :     
deletion_vectors_[filepath].AddElements({static_cast<uint64_t>(position)});
Do we know how efficient it is for large-scale deletes? I'd assume we can see 
the same filepath in a sequence, i.e. we could cache the previous values and 
iterators.


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@149
PS5, Line 149:     THash128 file_path_hash;
             :     file_path_hash.__set_high(hash_result.low64);
             :     file_path_hash.__set_low(hash_result.high64);
nit: should we swap high/low here and in IcebergUtil? It's just odd to see 
'__set_high(hash_result.low64);'


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@148
PS5, Line 148:     XXH128_hash_t hash_result = 
XXH3_128bits(data_file_path.c_str(), data_file_path.size());
             :     THash128 file_path_hash;
             :     file_path_hash.__set_high(hash_result.low64);
             :     file_path_hash.__set_low(hash_result.high64);
nit: this could go to a utility method.


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h
File be/src/util/thash128-util.h:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h@20
PS5, Line 20: #include <functional>
Is it needed?


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h@41
PS5, Line 41: inline
nit: 'inline' keyword is redundant here, functions defined in struct/class body 
are implicitly inline.


http://gerrit.cloudera.org:8080/#/c/24071/5/common/fbs/IcebergObjects.fbs
File common/fbs/IcebergObjects.fbs:

http://gerrit.cloudera.org:8080/#/c/24071/5/common/fbs/IcebergObjects.fbs@87
PS5, Line 87:   referenced_data_file_hash_high: long;
            :   referenced_data_file_hash_low: long;
When are these being used? If only in a later patch, could these be added 
later? In that case it will be easier to figure out if they are really needed.


http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift@124
PS5, Line 124:
nit: redundant space


http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/Types.thrift
File common/thrift/Types.thrift:

http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/Types.thrift@307
PS5, Line 307: 4: optional i64 record_count;
Could you please add a comment why and when it is needed?


http://gerrit.cloudera.org:8080/#/c/24071/5/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/24071/5/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@92
PS5, Line 92: "na", "na", "na"
Could be "Puffin", "Puffin", "Puffin" instead, in case if it appears somewhere 
(e.g. an error message).



--
To view, visit http://gerrit.cloudera.org:8080/24071
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I068a071f9db907064ccec8568db5234863eb4587
Gerrit-Change-Number: 24071
Gerrit-PatchSet: 5
Gerrit-Owner: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 18 Mar 2026 17:10:49 +0000
Gerrit-HasComments: Yes

Reply via email to