Great! Thanks Peter for letting me know. I'll take a look. Best,
On Fri, Mar 12, 2021 at 10:14 AM Peter Vary <pv...@cloudera.com.invalid> wrote: > Hi Edgar, > > You might want to take a look at this: > https://github.com/apache/iceberg/pull/2329 > > The PR aims to update Hive table statistics to the HiveCatalog when any > change to the table is committed. This solves the issue with the upstream > Hive code, and might solve the issue with other versions as well. > > Thanks, > Peter > > On Mar 4, 2021, at 09:43, Vivekanand Vellanki <vi...@dremio.com> wrote: > > Our concern is not specific to Iceberg. I am concerned about the memory > requirement in caching a large number of splits. > > With Iceberg, estimating row counts when the query has predicates requires > scanning the manifest list and manifest files to identify all the data > files; and compute the row count estimates. While it is reasonable to cache > these splits to avoid reading the manifest files twice, this increases the > memory requirement. Also, query engines might want to handle row count > estimation and split generation phases - row count estimation is required > for the cost based optimiser. Split generation can be done in parallel by > reading manifest files in parallel. > > It would be good to decouple row count estimation from split generation. > > On Wed, Mar 3, 2021 at 11:24 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> I agree with the concern about caching splits, but doesn't the API cause >> us to collect all of the splits into memory anyway? I thought there was no >> way to return splits as an `Iterator` that lazily loads them. If that's the >> case, then we primarily need to worry about cleanup and how long they are >> kept around. >> >> I think it is also fairly reasonable to do the planning twice to avoid >> the problem in Hive. Spark distributes the responsibility to each driver, >> so jobs are separate and don't affect one another. If this is happening on >> a shared Hive server endpoint then we probably have more of a concern about >> memory consumption. >> >> Vivekanand, can you share more detail about how/where this is happening >> in your case? >> >> On Wed, Mar 3, 2021 at 7:53 AM Edgar Rodriguez < >> edgar.rodrig...@airbnb.com.invalid> wrote: >> >>> On Wed, Mar 3, 2021 at 1:48 AM Peter Vary <pv...@cloudera.com.invalid> >>> wrote: >>> >>>> Quick question @Edgar: Am I right that the table is created by Spark? I >>>> think if it is created from Hive and we inserted the data from Hive, then >>>> we should have the basic stats already collected and we should not need the >>>> estimation (we might still do it, but probably we should not) >>>> >>> >>> Yes, Spark creates the table. We don't write Iceberg tables with Hive. >>> >>> >>>> >>>> Also we should check if Hive expects the full size of the table, or the >>>> size of the table after filters. If Hive collects this data by file >>>> scanning I would expect that it would be adequate to start with unfiltered >>>> raw size. >>>> >>> >>> In this case Hive is performing the FS scan to find the raw size of the >>> location to query - in this case since the table is unpartitioned (ICEBERG >>> type) the location to query is the full table since Hive is not aware of >>> Iceberg metadata. However, if the estimator is used it passes a >>> TableScanOperator, which I assume could be used to gather some specific >>> stats if present in the operator. >>> >>> >>>> >>>> Thanks, >>>> Peter >>>> >>>> >>>> Vivekanand Vellanki <vi...@dremio.com> ezt írta (időpont: 2021. márc. >>>> 3., Sze 5:15): >>>> >>>>> One of our concerns with caching the splits is the amount of memory >>>>> required for this. If the filtering is not very selective and the table >>>>> happens to be large, this increases the memory requirement to hold all the >>>>> splits in memory. >>>>> >>>> >>> I agree with this - caching the splits would be a concern with memory >>> consumption; even now serializing/deserializing (probably another topic for >>> discussion) splits in Hive for a query producing ~3.5K splits takes >>> considerable time. >>> >>> Cheers, >>> -- >>> Edgar R >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > -- Edgar R