[ 
https://issues.apache.org/jira/browse/SPARK-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-17179.
-------------------------------
       Resolution: Duplicate
    Fix Version/s: 2.1.0

> Consider improving partition pruning in HiveMetastoreCatalog
> ------------------------------------------------------------
>
>                 Key: SPARK-17179
>                 URL: https://issues.apache.org/jira/browse/SPARK-17179
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Rajesh Balamohan
>             Fix For: 2.1.0
>
>
> Issue:
> - Create an external table with 1000s of partition
> - Running simple query with partition details ends up listing all files for 
> caching in ListingFileCatalog.  This would turn out to be very slow in cloud 
> based FS access (e.g S3). Even though, ListingFileCatalog supports
> multi-threading, it would end up unncessarily listing 1000+ files when user 
> is just interested in 1 partition.
> - This adds up additional overhead in HiveMetastoreCatalog as it queries all 
> partitions in convertToLogicalRelation 
> (metastoreRelation.getHiveQlPartitions()).  Partition related details
> are not passed in here, so ends up overloading hive metastore.
> - Also even if any partition changes, cache would be dirtied and have to be 
> re-populated.  It would be nice to prune the partitions in metastore layer 
> itself, so that few partitions are looked up via FileSystem and only few 
> items are cached.
> {noformat}
> "CREATE EXTERNAL TABLE `ca_par_ext`(
>   `customer_id` bigint,
>   `account_id` bigint)
> PARTITIONED BY (
>   `effective_date` date)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   's3a://bucket_details/ca_par'"
> explain select count(*) from ca_par_ext where effective_date between 
> '2015-12-17' and '2015-12-18';
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to