Re: question about reader task planning & BinPacking

2020-07-17 Thread Sud
ok after adding more instrumentation I see that Reader::estimateStatistics may be a culprit. looks like estimated stats may be performing full table estimate and thats why it is so slow. does any one know if it is possible to avoid Reader::estimateStatistics? Also does estimateStatistics use appr

Re: question about reader task planning & BinPacking

2020-07-17 Thread Sud
Thanks @Jingsong for reply Yes one additional data point about the table. This table is avro table and generated from stream ingestion. We expect a couple of thousand snapshots created daily. We are using appendsBetween API , I am I think any compaction operation will break the API. but I will ta

Re: question about reader task planning & BinPacking

2020-07-17 Thread Jingsong Li
Hi Sud, The batch read of the Iceberg table should just read the latest snapshot. I think this case is that your large tables have a large number of manifest files. 1.The simple way is reducing manifest file numbers: - For reducing manifest file number, you can try `Actions.rewriteManifests`(Than

question about reader task planning & BinPacking

2020-07-16 Thread Sud
HI Iceberg-devs We are trying to root cause issue where driver get stuck when trying to read comparatively large tables ( > 2000 snapshots) When I tried to look at the thread dump of the driver's main thread I see that thread is stuck in planning tasks. I also noticed that iceberg-worker-pool is