Re: Spark SQL reads all leaf directories on a partitioned Hive table

Gourav Sengupta Thu, 08 Aug 2019 04:24:20 -0700

Hi,

Just out of curiosity did you start the SPARK session using
enableHiveSupport() ?


Or are you creating the table using SPARK?


Regards,
Gourav

On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <inv...@gmail.com> wrote:

> Hi,
> I am using Spark SQL 2.3.3 to read a hive table which is partitioned by
> day, hour, platform, request_status and is_sampled. The underlying data is
> in parquet format on HDFS.
> Here is the SQL query to read just *one partition*.
>
> ```
> spark.sql("""
> SELECT rtb_platform_id, SUM(e_cpm)
> FROM raw_logs.fact_request
> WHERE day = '2019-08-01'
> AND hour = '00'
> AND platform = 'US'
> AND request_status = '3'
> AND is_sampled = 1
> GROUP BY rtb_platform_id
> """).show
> ```
>
> However, from the Spark web UI, the stage description shows:
>
> ```
> Listing leaf files and directories for 201616 paths:
> viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0,
> ...
> ```
>
> It seems the job is reading all of the partitions of the table and the job
> takes too long for just one partition. One workaround is using
> `spark.read.parquet` API to read parquet files directly. Spark has
> partition-awareness for partitioned directories.
>
> But still, I would like to know if there is a way to leverage
> partition-awareness via Hive by using `spark.sql` API?
>
> Any help is highly appreciated!
>
> Thank you.
>
> --
> Hao Ren
>

Re: Spark SQL reads all leaf directories on a partitioned Hive table

Reply via email to