Hi, Just out of curiosity did you start the SPARK session using enableHiveSupport() ?
Or are you creating the table using SPARK? Regards, Gourav On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <inv...@gmail.com> wrote: > Hi, > I am using Spark SQL 2.3.3 to read a hive table which is partitioned by > day, hour, platform, request_status and is_sampled. The underlying data is > in parquet format on HDFS. > Here is the SQL query to read just *one partition*. > > ``` > spark.sql(""" > SELECT rtb_platform_id, SUM(e_cpm) > FROM raw_logs.fact_request > WHERE day = '2019-08-01' > AND hour = '00' > AND platform = 'US' > AND request_status = '3' > AND is_sampled = 1 > GROUP BY rtb_platform_id > """).show > ``` > > However, from the Spark web UI, the stage description shows: > > ``` > Listing leaf files and directories for 201616 paths: > viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, > ... > ``` > > It seems the job is reading all of the partitions of the table and the job > takes too long for just one partition. One workaround is using > `spark.read.parquet` API to read parquet files directly. Spark has > partition-awareness for partitioned directories. > > But still, I would like to know if there is a way to leverage > partition-awareness via Hive by using `spark.sql` API? > > Any help is highly appreciated! > > Thank you. > > -- > Hao Ren >