Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/24071 )
Change subject: IMPALA-14755:(part 1) Implement Puffin Blob reader and File writer ...................................................................... Patch Set 5: (14 comments) http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/output-partition.h File be/src/exec/output-partition.h: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/output-partition.h@132 PS5, Line 132: std::unique_ptr<PuffinWriteResult> puffin_result = : std::make_unique<PuffinWriteResult>(); nit: Since you always create an object, can this be a normal field? I.e.: PuffinWriteResult puffin_result; http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/blob.h File be/src/exec/puffin/blob.h: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/blob.h@25 PS5, Line 25: namespace impala { nit: add empty line before namespace impala. http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h File be/src/exec/puffin/puffin-writer.h: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h@107 PS5, Line 107: using RapidJSON nit: we probably don't want to mention this implementation detail in the function comment http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h@119 PS5, Line 119: = puffin::File(); nit: this seems unnecessary http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc File be/src/exec/puffin/puffin-writer.cc: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@107 PS5, Line 107: // We expect the row batch to have two columns: filepath (string) and position (bigint) : // Similar to how IcebergBufferedDeleteSink processes rows The comment fits L111-113 better. http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@122 PS5, Line 122: std::string filepath(reinterpret_cast<char*>(filepath_sv.ptr), filepath_sv.len); : : // Extract position (row number) : impala_udf::BigIntVal position_bi = position_eval->GetBigIntVal(row); : DCHECK(!position_bi.is_null); : int64_t position = position_bi.val; : : // Add the position to the bitmap for this file : // If the file doesn't exist in the map yet, it will be created : deletion_vectors_[filepath].AddElements({static_cast<uint64_t>(position)}); Do we know how efficient it is for large-scale deletes? I'd assume we can see the same filepath in a sequence, i.e. we could cache the previous values and iterators. http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@149 PS5, Line 149: THash128 file_path_hash; : file_path_hash.__set_high(hash_result.low64); : file_path_hash.__set_low(hash_result.high64); nit: should we swap high/low here and in IcebergUtil? It's just odd to see '__set_high(hash_result.low64);' http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@148 PS5, Line 148: XXH128_hash_t hash_result = XXH3_128bits(data_file_path.c_str(), data_file_path.size()); : THash128 file_path_hash; : file_path_hash.__set_high(hash_result.low64); : file_path_hash.__set_low(hash_result.high64); nit: this could go to a utility method. http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h File be/src/util/thash128-util.h: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h@20 PS5, Line 20: #include <functional> Is it needed? http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h@41 PS5, Line 41: inline nit: 'inline' keyword is redundant here, functions defined in struct/class body are implicitly inline. http://gerrit.cloudera.org:8080/#/c/24071/5/common/fbs/IcebergObjects.fbs File common/fbs/IcebergObjects.fbs: http://gerrit.cloudera.org:8080/#/c/24071/5/common/fbs/IcebergObjects.fbs@87 PS5, Line 87: referenced_data_file_hash_high: long; : referenced_data_file_hash_low: long; When are these being used? If only in a later patch, could these be added later? In that case it will be easier to figure out if they are really needed. http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift File common/thrift/DataSinks.thrift: http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift@124 PS5, Line 124: nit: redundant space http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/Types.thrift File common/thrift/Types.thrift: http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/Types.thrift@307 PS5, Line 307: 4: optional i64 record_count; Could you please add a comment why and when it is needed? http://gerrit.cloudera.org:8080/#/c/24071/5/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java: http://gerrit.cloudera.org:8080/#/c/24071/5/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@92 PS5, Line 92: "na", "na", "na" Could be "Puffin", "Puffin", "Puffin" instead, in case if it appears somewhere (e.g. an error message). -- To view, visit http://gerrit.cloudera.org:8080/24071 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I068a071f9db907064ccec8568db5234863eb4587 Gerrit-Change-Number: 24071 Gerrit-PatchSet: 5 Gerrit-Owner: Peter Rozsa <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 18 Mar 2026 17:10:49 +0000 Gerrit-HasComments: Yes
