Hello, I'd like to propose a modification in the way Hive table partition metadata are loaded and cached. Currently, when a user reads from a partitioned Hive table whose metadata are not cached (and for which Hive table conversion is enabled and supported), all partition metadata is fetched from the metastore:
https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260 <https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260> This is highly inefficient in some scenarios. In the most extreme case, a user starts a new Spark app, runs a query which reads from a single partition in a table with a large number of partitions and terminates their app. All partition metadata are loaded and their files' schema are merged, but only a single partition is read. Instead, I propose we load and cache partition metadata on-demand, as needed to build query plans. We've long encountered this performance problem at VideoAmp and have taken different approaches to address it. In addition to the load time, we've found that loading all of a table's partition metadata can require a significant amount of JVM heap space. Our largest tables OOM our Spark drivers unless we allocate several GB of heap space. Certainly one could argue that our situation is pathological and rare, and that the problem in our scenario is with the design of our tables—not Spark. However, even in tables with more modest numbers of partitions, loading only the necessary partition metadata and file schema can significantly reduce the query planning time, and is definitely more memory efficient. I've written POCs for a couple of different implementation approaches. Though incomplete, both have been successful in their basic goal. The first extends `org.apache.spark.sql.catalyst.catalog.ExternalCatalog` and as such is more general. It requires some new abstractions and refactoring of `HadoopFsRelation` and `FileCatalog`, among others. It places a greater burden on other implementations of `ExternalCatalog`. Currently the only other implementation of `ExternalCatalog` is `InMemoryCatalog`, and my code throws an `UnsupportedOperationException` on that implementation. The other approach is simpler and only touches code in the codebase's `hive` project. Basically, conversion of `MetastoreRelation` to `HadoopFsRelation` is deferred to physical planning when the metastore relation is partitioned. During physical planning, the partition pruning filters in a logical query plan are used to select the required partition metadata and a `HadoopFsRelation` is built from those. The new logical plan is then re-injected into the planner. I'd like to get the community's thoughts on my proposal and implementation approaches. Thanks! Michael