[
https://issues.apache.org/jira/browse/SPARK-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin closed SPARK-17179.
-------------------------------
Resolution: Duplicate
Fix Version/s: 2.1.0
> Consider improving partition pruning in HiveMetastoreCatalog
> ------------------------------------------------------------
>
> Key: SPARK-17179
> URL: https://issues.apache.org/jira/browse/SPARK-17179
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Rajesh Balamohan
> Fix For: 2.1.0
>
>
> Issue:
> - Create an external table with 1000s of partition
> - Running simple query with partition details ends up listing all files for
> caching in ListingFileCatalog. This would turn out to be very slow in cloud
> based FS access (e.g S3). Even though, ListingFileCatalog supports
> multi-threading, it would end up unncessarily listing 1000+ files when user
> is just interested in 1 partition.
> - This adds up additional overhead in HiveMetastoreCatalog as it queries all
> partitions in convertToLogicalRelation
> (metastoreRelation.getHiveQlPartitions()). Partition related details
> are not passed in here, so ends up overloading hive metastore.
> - Also even if any partition changes, cache would be dirtied and have to be
> re-populated. It would be nice to prune the partitions in metastore layer
> itself, so that few partitions are looked up via FileSystem and only few
> items are cached.
> {noformat}
> "CREATE EXTERNAL TABLE `ca_par_ext`(
> `customer_id` bigint,
> `account_id` bigint)
> PARTITIONED BY (
> `effective_date` date)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 's3a://bucket_details/ca_par'"
> explain select count(*) from ca_par_ext where effective_date between
> '2015-12-17' and '2015-12-18';
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]