Re: Hive query with join of Iceberg table and Hive table

Edgar Rodriguez Fri, 12 Mar 2021 13:24:04 -0800

Great! Thanks Peter for letting me know. I'll take a look.

Best,


On Fri, Mar 12, 2021 at 10:14 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Edgar,
>
> You might want to take a look at this:
> https://github.com/apache/iceberg/pull/2329
>
> The PR aims to update Hive table statistics to the HiveCatalog when any
> change to the table is committed. This solves the issue with the upstream
> Hive code, and might solve the issue with other versions as well.
>
> Thanks,
> Peter
>
> On Mar 4, 2021, at 09:43, Vivekanand Vellanki <vi...@dremio.com> wrote:
>
> Our concern is not specific to Iceberg. I am concerned about the memory
> requirement in caching a large number of splits.
>
> With Iceberg, estimating row counts when the query has predicates requires
> scanning the manifest list and manifest files to identify all the data
> files; and compute the row count estimates. While it is reasonable to cache
> these splits to avoid reading the manifest files twice, this increases the
> memory requirement. Also, query engines might want to handle row count
> estimation and split generation phases - row count estimation is required
> for the cost based optimiser. Split generation can be done in parallel by
> reading manifest files in parallel.
>
> It would be good to decouple row count estimation from split generation.
>
> On Wed, Mar 3, 2021 at 11:24 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I agree with the concern about caching splits, but doesn't the API cause
>> us to collect all of the splits into memory anyway? I thought there was no
>> way to return splits as an `Iterator` that lazily loads them. If that's the
>> case, then we primarily need to worry about cleanup and how long they are
>> kept around.
>>
>> I think it is also fairly reasonable to do the planning twice to avoid
>> the problem in Hive. Spark distributes the responsibility to each driver,
>> so jobs are separate and don't affect one another. If this is happening on
>> a shared Hive server endpoint then we probably have more of a concern about
>> memory consumption.
>>
>> Vivekanand, can you share more detail about how/where this is happening
>> in your case?
>>
>> On Wed, Mar 3, 2021 at 7:53 AM Edgar Rodriguez <
>> edgar.rodrig...@airbnb.com.invalid> wrote:
>>
>>> On Wed, Mar 3, 2021 at 1:48 AM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Quick question @Edgar: Am I right that the table is created by Spark? I
>>>> think if it is created from Hive and we inserted the data from Hive, then
>>>> we should have the basic stats already collected and we should not need the
>>>> estimation (we might still do it, but probably we should not)
>>>>
>>>
>>> Yes, Spark creates the table. We don't write Iceberg tables with Hive.
>>>
>>>
>>>>
>>>> Also we should check if Hive expects the full size of the table, or the
>>>> size of the table after filters. If Hive collects this data by file
>>>> scanning I would expect that it would be adequate to start with unfiltered
>>>> raw size.
>>>>
>>>
>>> In this case Hive is performing the FS scan to find the raw size of the
>>> location to query - in this case since the table is unpartitioned (ICEBERG
>>> type) the location to query is the full table since Hive is not aware of
>>> Iceberg metadata. However, if the estimator is used it passes a
>>> TableScanOperator, which I assume could be used to gather some specific
>>> stats if present in the operator.
>>>
>>>
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>>
>>>> Vivekanand Vellanki <vi...@dremio.com> ezt írta (időpont: 2021. márc.
>>>> 3., Sze 5:15):
>>>>
>>>>> One of our concerns with caching the splits is the amount of memory
>>>>> required for this. If the filtering is not very selective and the table
>>>>> happens to be large, this increases the memory requirement to hold all the
>>>>> splits in memory.
>>>>>
>>>>
>>> I agree with this - caching the splits would be a concern with memory
>>> consumption; even now serializing/deserializing (probably another topic for
>>> discussion) splits in Hive for a query producing ~3.5K splits takes
>>> considerable time.
>>>
>>> Cheers,
>>> --
>>> Edgar R
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Edgar R

Re: Hive query with join of Iceberg table and Hive table

Reply via email to