I agree with the concern about caching splits, but doesn't the API cause us
to collect all of the splits into memory anyway? I thought there was no way
to return splits as an `Iterator` that lazily loads them. If that's the
case, then we primarily need to worry about cleanup and how long they are
kept around.

I think it is also fairly reasonable to do the planning twice to avoid the
problem in Hive. Spark distributes the responsibility to each driver, so
jobs are separate and don't affect one another. If this is happening on a
shared Hive server endpoint then we probably have more of a concern about
memory consumption.

Vivekanand, can you share more detail about how/where this is happening in
your case?

On Wed, Mar 3, 2021 at 7:53 AM Edgar Rodriguez
<edgar.rodrig...@airbnb.com.invalid> wrote:

> On Wed, Mar 3, 2021 at 1:48 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Quick question @Edgar: Am I right that the table is created by Spark? I
>> think if it is created from Hive and we inserted the data from Hive, then
>> we should have the basic stats already collected and we should not need the
>> estimation (we might still do it, but probably we should not)
>>
>
> Yes, Spark creates the table. We don't write Iceberg tables with Hive.
>
>
>>
>> Also we should check if Hive expects the full size of the table, or the
>> size of the table after filters. If Hive collects this data by file
>> scanning I would expect that it would be adequate to start with unfiltered
>> raw size.
>>
>
> In this case Hive is performing the FS scan to find the raw size of the
> location to query - in this case since the table is unpartitioned (ICEBERG
> type) the location to query is the full table since Hive is not aware of
> Iceberg metadata. However, if the estimator is used it passes a
> TableScanOperator, which I assume could be used to gather some specific
> stats if present in the operator.
>
>
>>
>> Thanks,
>> Peter
>>
>>
>> Vivekanand Vellanki <vi...@dremio.com> ezt írta (időpont: 2021. márc.
>> 3., Sze 5:15):
>>
>>> One of our concerns with caching the splits is the amount of memory
>>> required for this. If the filtering is not very selective and the table
>>> happens to be large, this increases the memory requirement to hold all the
>>> splits in memory.
>>>
>>
> I agree with this - caching the splits would be a concern with memory
> consumption; even now serializing/deserializing (probably another topic for
> discussion) splits in Hive for a query producing ~3.5K splits takes
> considerable time.
>
> Cheers,
> --
> Edgar R
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to