On Wed, Mar 3, 2021 at 1:48 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Quick question @Edgar: Am I right that the table is created by Spark? I
> think if it is created from Hive and we inserted the data from Hive, then
> we should have the basic stats already collected and we should not need the
> estimation (we might still do it, but probably we should not)
>

Yes, Spark creates the table. We don't write Iceberg tables with Hive.


>
> Also we should check if Hive expects the full size of the table, or the
> size of the table after filters. If Hive collects this data by file
> scanning I would expect that it would be adequate to start with unfiltered
> raw size.
>

In this case Hive is performing the FS scan to find the raw size of the
location to query - in this case since the table is unpartitioned (ICEBERG
type) the location to query is the full table since Hive is not aware of
Iceberg metadata. However, if the estimator is used it passes a
TableScanOperator, which I assume could be used to gather some specific
stats if present in the operator.


>
> Thanks,
> Peter
>
>
> Vivekanand Vellanki <vi...@dremio.com> ezt írta (időpont: 2021. márc. 3.,
> Sze 5:15):
>
>> One of our concerns with caching the splits is the amount of memory
>> required for this. If the filtering is not very selective and the table
>> happens to be large, this increases the memory requirement to hold all the
>> splits in memory.
>>
>
I agree with this - caching the splits would be a concern with memory
consumption; even now serializing/deserializing (probably another topic for
discussion) splits in Hive for a query producing ~3.5K splits takes
considerable time.

Cheers,
-- 
Edgar R

Reply via email to