Re: [I] Comet native Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks [datafusion-comet]

via GitHub Wed, 10 Jun 2026 23:37:21 -0700


advancedxy commented on issue #4590:
URL: 
https://github.com/apache/datafusion-comet/issues/4590#issuecomment-4677860731


   ```
           val dataFile = org.apache.iceberg.DataFiles
           .builder(table.spec())
           .withPath(sourceParquetFile.getAbsolutePath)
           .withFormat(org.apache.iceberg.FileFormat.PARQUET)
           .withFileSizeInBytes(sourceParquetFile.length())
           .withRecordCount(1)
           .build()
   ```
   
   I  think this might be  the problem part. For Iceberg parquet files produced 
query engines and iceberg connector(such as spark), the split offset is infer 
and generated when committing, see [ref 
1](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/io/DataWriter.java#L93).
  The example data file metadata is manually created without split offsets. It 
will split by the request split size rather using the actual row group split 
offsets.
   
   Anyway, this is still a valid data file, and the problem should be fixed in 
the iceberg-rust side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Comet native Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks [datafusion-comet]

Reply via email to