924060929 commented on PR #65135:
URL: https://github.com/apache/doris/pull/65135#issuecomment-4875292359

   ### Design notes (for discussion, not blockers)
   
   Separating these from the bugs above since they're about the integration 
seams, not clear defects. The core direction is right: `position_deletes` is 
fundamentally a file scan (unlike the other sys tables, which are JVM-side 
`DataTask`s), so reading it natively in BE and reusing the Parquet/ORC readers 
+ the deletion-vector helper is the correct call.
   
   - **Partition column is built at the wrong layer.** Rendering the delete 
file's own-spec partition to plan-time JSON and then matching it positionally 
against the sys table's *union* partition struct (`Partitioning.partitionType`) 
is exactly what produces the silent-`NULL` in #3. Coercing `PartitionData` to 
the table's unified partition type (as Iceberg's own metadata scan does) would 
be more robust than reconstructing per-file-spec JSON on the FE.
   
   - **Sys-table split planning isn't polymorphic.** `position_deletes` is 
special-cased by string compare in `doGetSystemTableSplits` plus a parallel 
`setIcebergPositionDeleteSysTableParams`. The next natively-planned sys table 
would need the same copy-paste (string check + `doGetXxxSplits` + 
`createXxxSysSplit` + `setXxxParams` + a new `TableFormatType` + a BE bypass). 
A hook on `SysTable`/`IcebergSysTable` would localize this.
   
   - **`IcebergSplit` has become a 3-way union.** Seven `positionDelete*` 
fields + a boolean discriminator now sit on the shared `IcebergSplit` alongside 
the data-split and JNI-serialized-split shapes. `positionDeleteContent` is a 
primitive `int` defaulting to `0`, which is the `DATA` content id — a split 
that sets the DV flag but forgets to set the content would be misread as a 
plain position delete with no error. A subclass (or a self-describing 
descriptor) would make the three shapes explicit.
   
   - **Thrift `TIcebergDeleteFileDesc.file_format`** is stamped for this path 
but deliberately left unset by the MOR delete path (and PUFFIN is relabeled 
`PARQUET` so the range lands on the duplicated `file_scanner` branch). Worth 
aligning so the same field means one thing regardless of which producer filled 
it.
   
   These are design opinions, not blockers — happy to defer to your and the 
committers' judgment.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to