One of our concerns with caching the splits is the amount of memory required for this. If the filtering is not very selective and the table happens to be large, this increases the memory requirement to hold all the splits in memory.
On Wed, Mar 3, 2021 at 6:56 AM Ryan Blue <rb...@netflix.com.invalid> wrote: > Yes, it sounds like we do want to at least support fake stats for Hive. It > would be great to also base the stats on the actual table scan, since we > can get fairly accurate stats after filters have been pushed down. In > Spark, that's allowed us to convert more joins to broadcast joins that are > cheaper. The only concern is when and where the stats estimation is done > and whether we'd be able to cache the result of planning the scan so we > don't plan a scan for stats estimation and then do it a second time to > produce splits. > > On Tue, Mar 2, 2021 at 4:12 PM Edgar Rodriguez > <edgar.rodrig...@airbnb.com.invalid> wrote: > >> After a bit of further digging, I found that the issue is related to Hive >> trying to find the input size (the Iceberg table) for the join at query >> planning time. Since HiveIcebergStorageHandler does not implement >> InputEstimator >> <https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2195>, >> Hive tries to estimate the input size the same way as it would do for a >> native Hive table, by scanning the FS listing the paths recursively >> <https://github.com/apache/hadoop/blob/branch-2.8.5/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1489> >> and adding the file lengths - in the case of Iceberg tables it would start >> scanning from the table location since it's EXTERNAL unpartitioned - as >> mentioned in the Hive Wiki >> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903061#ConfigurationProperties-hive.fetch.task.conversion.threshold> >> : >> >> If target table is native, input length is calculated by summation of >>> file lengths. If it's not native, the storage handler for the table can >>> optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator >>> interface. >> >> >> After adding the interface to the storage_handler and providing an >> implementation returning an Estimation(-1, -1) >> <https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/metadata/InputEstimator.Estimation.html#Estimation-int-long-> >> the query works successfully in the expected amount of time - maybe a >> better implementation can be done with the actual extimation. I assume this >> is only an issue you hit when the underlying FS tree of the Iceberg table >> is large and traversing the FS takes a long time, otherwise most likely >> Hive would do the FS traversal and the query would make progress. >> >> Should we make this change in the HiveIcebergStorageHandler? >> >> Cheers, >> >> On Tue, Mar 2, 2021 at 1:11 PM Peter Vary <pv...@cloudera.com.invalid> >> wrote: >> >>> I have seen this kind of problem when the catalog was not configured for >>> the table/session and we ended up using the default catalog instead of >>> HiveCatalog >>> >>> On Mar 2, 2021, at 18:49, Edgar Rodriguez < >>> edgar.rodrig...@airbnb.com.INVALID> wrote: >>> >>> Hi, >>> >>> I'm trying to run a simple query in Hive 2.3.4 with a join of a Hive >>> table and an Iceberg table, each configured accordingly - Iceberg table has >>> the `storage_handler` defined and running with MR engine. >>> >>> I'm using the `iceberg.mr.catalog.loader.class` class to load our >>> internal catalog. In the logs I can see Hive loading the Iceberg table, but >>> then I can see the Driver doing some traversal through the FS path under >>> the table location, getting statuses for all data within the directory - >>> this is not the behavior I see when querying an Iceberg table in Hive by >>> itself, where I can see the splits being computed correctly. >>> Due to this behavior, the query basically scans the full FS structure >>> under the path - which if large it looks like it's stuck, however I do see >>> the wire activity fetching the FS listings. >>> >>> Question is, has anyone experienced this behavior on querying Hive >>> tables with joins on Iceberg tables? If so, what's the best way to approach >>> this? >>> >>> Best, >>> -- >>> Edgar R >>> >>> >>> >> >> -- >> Edgar R >> > > > -- > Ryan Blue > Software Engineer > Netflix >