Thanks Sud for in-depth debugging. And thanks Ryan for the explanation.
+1 to have a table property to disable stats estimation.
IIUC, the difference between stats estimation and scan with filters is
mainly in the partition filters:
Iceberg uses filter-push-down to complete partition pruning. So
Hey, great question. I just caught up on the other thread, but let me
provide some context here.
Spark uses the stats estimation here to determine whether or not to
broadcast. If we returned a default value, then Spark wouldn't be able to
use Iceberg tables in broadcast joins. Even though Spark wo
As per java doc estimateStatistics does not take into account any
operators, any reason why iceberg reader implements this? I wonder if it
would help to make it configurable and return default value.
/**
* A mix in interface for {@link DataSourceReader}. Data source
readers can implement this
*
ok after adding more instrumentation I see that Reader::estimateStatistics
may be a culprit.
looks like estimated stats may be performing full table estimate and thats
why it is so slow. does any one know if it is possible to
avoid Reader::estimateStatistics?
Also does estimateStatistics use appr
Thanks @Jingsong for reply
Yes one additional data point about the table.
This table is avro table and generated from stream ingestion. We expect a
couple of thousand snapshots created daily.
We are using appendsBetween API , I am I think any compaction operation
will break the API. but I will ta
Hi Sud,
The batch read of the Iceberg table should just read the latest snapshot.
I think this case is that your large tables have a large number of manifest
files.
1.The simple way is reducing manifest file numbers:
- For reducing manifest file number, you can try
`Actions.rewriteManifests`(Than