Re: question about reader task planning using SupportsReportStatistics

2020-07-17 Thread Jingsong Li
Thanks Sud for in-depth debugging. And thanks Ryan for the explanation. +1 to have a table property to disable stats estimation. IIUC, the difference between stats estimation and scan with filters is mainly in the partition filters: Iceberg uses filter-push-down to complete partition pruning. So

Re: question about reader task planning using SupportsReportStatistics

2020-07-17 Thread Ryan Blue
Hey, great question. I just caught up on the other thread, but let me provide some context here. Spark uses the stats estimation here to determine whether or not to broadcast. If we returned a default value, then Spark wouldn't be able to use Iceberg tables in broadcast joins. Even though Spark wo

Re: question about reader task planning using SupportsReportStatistics

2020-07-17 Thread Sud
As per java doc estimateStatistics does not take into account any operators, any reason why iceberg reader implements this? I wonder if it would help to make it configurable and return default value. /** * A mix in interface for {@link DataSourceReader}. Data source readers can implement this *

Re: question about reader task planning & BinPacking

2020-07-17 Thread Sud
ok after adding more instrumentation I see that Reader::estimateStatistics may be a culprit. looks like estimated stats may be performing full table estimate and thats why it is so slow. does any one know if it is possible to avoid Reader::estimateStatistics? Also does estimateStatistics use appr

Re: question about reader task planning & BinPacking

2020-07-17 Thread Sud
Thanks @Jingsong for reply Yes one additional data point about the table. This table is avro table and generated from stream ingestion. We expect a couple of thousand snapshots created daily. We are using appendsBetween API , I am I think any compaction operation will break the API. but I will ta

Re: question about reader task planning & BinPacking

2020-07-17 Thread Jingsong Li
Hi Sud, The batch read of the Iceberg table should just read the latest snapshot. I think this case is that your large tables have a large number of manifest files. 1.The simple way is reducing manifest file numbers: - For reducing manifest file number, you can try `Actions.rewriteManifests`(Than