Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/20760 )
Change subject: IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables. ...................................................................... Patch Set 3: (19 comments) Thanks Zoltán, great patch! So far I've only been able to review it up until iceberg-buffered-delete-sink.cc. http://gerrit.cloudera.org:8080/#/c/20760/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20760/2//COMMIT_MSG@26 PS2, Line 26: FlushFinal Can it spill to disk if the data it has received can't be stored in memory? http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h File be/src/exec/iceberg-buffered-delete-sink.h: http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@48 PS3, Line 48: to the temporary Hdfs files It can be misunderstood, one may thing that we write _into_ Hdfs files, like spilling to disk. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@69 PS3, Line 69: class FilePositionsIterator { Can this be put into the .cc file? http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@74 PS3, Line 74: pos_level_it_ = file_level_it_->second.begin(); Shouldn't we check whether file_pos is empty? http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@78 PS3, Line 78: StringValue filepath = file_level_it_->first; We should check for emptiness. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@83 PS3, Line 83: bool HasNext() { return file_level_it_ != file_level_end_; } Does HasNext() mean we can call Get() or that there is a valid item after the current one? In the first case is misunderstandable. If it is we have a valid item after the last one I think there's an off-by-one error: when we're currently at the last file position of the last file, both 'file_level_it_' and 'pos_level_it_' point to valid fields (HasNext() returns true) but there is no valid next element. See also L98. The semantics of Get(), Next() and HasNext() could be clarified. We could also eliminate Get() and have Next() return the next value. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@96 PS3, Line 96: DCHECK(file_level_it_ != file_level_end_); A comment on Next() could indicate that Next() shouldn't be called when iteration is over. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@98 PS3, Line 98: file_level_it_ After incrementing 'file_level_it_' on the previous line, it may now point past-the-end, so we can't dereference it. See also L83. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@121 PS3, Line 121: /// Verifies that there are no duplicates in the buffered delete records. We could include in the comment that it should be called after SortBufferedRecords(). http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@151 PS3, Line 151: server Nit: serve? http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc File be/src/exec/iceberg-buffered-delete-sink.cc: http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@124 PS3, Line 124: filepath_sv We could DCHECK that filepath_sv is not NULL. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@149 PS3, Line 149: auto Optional: to ease understanding, we could use actual types instead of auto in these loops. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@161 PS3, Line 161: auto See L149. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@184 PS3, Line 184: auto See L149. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@196 PS3, Line 196: file_and_pos.first We could use 'file' for 'file_and_pos.first' here and with the Len() too. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@218 PS3, Line 218: s Isn't it only one partition? Maybe 'partition_encoded' would be more understandable? http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@273 PS3, Line 273: // we might need to delete rows from partition "col_trunc=1000" with both spec ids. How is this comment connected to the next line? Could you explain? http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@288 PS3, Line 288: buffer Shouldn't it be '*buffer'? http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.cc@320 PS3, Line 320: int64_t* pos_slot = row->GetTuple(position_ref->GetTupleIdx())-> Just curious, can filepath_ref->GetTupleIdx() be different from position_ref->GetTupleIdx()? If not we could add a DCHECK. -- To view, visit http://gerrit.cloudera.org:8080/20760 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bb97b4454165a292975d88dc9c23adb22ff7315 Gerrit-Change-Number: 20760 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Comment-Date: Fri, 08 Dec 2023 17:07:52 +0000 Gerrit-HasComments: Yes
