Re: question about reader task planning using SupportsReportStatistics

2020-07-19 Thread Sud
Thanks Ryan and Jingsong , I will add one more TODO to see if we can use spark to parallelize estimation even when predicate pushdown is done. ( spark does this for file system based tables) and possibly for manifest readers. I will try to submit PR upstream for adding options and will create iss

Re: question about reader task planning using SupportsReportStatistics

2020-07-17 Thread Jingsong Li
Thanks Sud for in-depth debugging. And thanks Ryan for the explanation. +1 to have a table property to disable stats estimation. IIUC, the difference between stats estimation and scan with filters is mainly in the partition filters: Iceberg uses filter-push-down to complete partition pruning. So

Re: question about reader task planning using SupportsReportStatistics

2020-07-17 Thread Ryan Blue
Hey, great question. I just caught up on the other thread, but let me provide some context here. Spark uses the stats estimation here to determine whether or not to broadcast. If we returned a default value, then Spark wouldn't be able to use Iceberg tables in broadcast joins. Even though Spark wo

Re: question about reader task planning using SupportsReportStatistics

2020-07-17 Thread Sud
As per java doc estimateStatistics does not take into account any operators, any reason why iceberg reader implements this? I wonder if it would help to make it configurable and return default value. /** * A mix in interface for {@link DataSourceReader}. Data source readers can implement this *