Hi Edgar, You might want to take a look at this: https://github.com/apache/iceberg/pull/2329 <https://github.com/apache/iceberg/pull/2329>
The PR aims to update Hive table statistics to the HiveCatalog when any change to the table is committed. This solves the issue with the upstream Hive code, and might solve the issue with other versions as well. Thanks, Peter > On Mar 4, 2021, at 09:43, Vivekanand Vellanki <vi...@dremio.com> wrote: > > Our concern is not specific to Iceberg. I am concerned about the memory > requirement in caching a large number of splits. > > With Iceberg, estimating row counts when the query has predicates requires > scanning the manifest list and manifest files to identify all the data files; > and compute the row count estimates. While it is reasonable to cache these > splits to avoid reading the manifest files twice, this increases the memory > requirement. Also, query engines might want to handle row count estimation > and split generation phases - row count estimation is required for the cost > based optimiser. Split generation can be done in parallel by reading manifest > files in parallel. > > It would be good to decouple row count estimation from split generation. > > On Wed, Mar 3, 2021 at 11:24 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > I agree with the concern about caching splits, but doesn't the API cause us > to collect all of the splits into memory anyway? I thought there was no way > to return splits as an `Iterator` that lazily loads them. If that's the case, > then we primarily need to worry about cleanup and how long they are kept > around. > > I think it is also fairly reasonable to do the planning twice to avoid the > problem in Hive. Spark distributes the responsibility to each driver, so jobs > are separate and don't affect one another. If this is happening on a shared > Hive server endpoint then we probably have more of a concern about memory > consumption. > > Vivekanand, can you share more detail about how/where this is happening in > your case? > > On Wed, Mar 3, 2021 at 7:53 AM Edgar Rodriguez > <edgar.rodrig...@airbnb.com.invalid> wrote: > On Wed, Mar 3, 2021 at 1:48 AM Peter Vary <pv...@cloudera.com.invalid> wrote: > Quick question @Edgar: Am I right that the table is created by Spark? I think > if it is created from Hive and we inserted the data from Hive, then we should > have the basic stats already collected and we should not need the estimation > (we might still do it, but probably we should not) > > Yes, Spark creates the table. We don't write Iceberg tables with Hive. > > > Also we should check if Hive expects the full size of the table, or the size > of the table after filters. If Hive collects this data by file scanning I > would expect that it would be adequate to start with unfiltered raw size. > > In this case Hive is performing the FS scan to find the raw size of the > location to query - in this case since the table is unpartitioned (ICEBERG > type) the location to query is the full table since Hive is not aware of > Iceberg metadata. However, if the estimator is used it passes a > TableScanOperator, which I assume could be used to gather some specific stats > if present in the operator. > > > Thanks, > Peter > > > Vivekanand Vellanki <vi...@dremio.com <mailto:vi...@dremio.com>> ezt írta > (időpont: 2021. márc. 3., Sze 5:15): > One of our concerns with caching the splits is the amount of memory required > for this. If the filtering is not very selective and the table happens to be > large, this increases the memory requirement to hold all the splits in memory. > > I agree with this - caching the splits would be a concern with memory > consumption; even now serializing/deserializing (probably another topic for > discussion) splits in Hive for a query producing ~3.5K splits takes > considerable time. > > Cheers, > -- > Edgar R > > > -- > Ryan Blue > Software Engineer > Netflix