Re: optimal way to load parquet files with partition

Michael Armbrust Tue, 02 Feb 2016 10:14:21 -0800

It depends how many partitions you have and if you are only doing a single
operation.  Loading all the data and filtering will require us to scan the
directories to discover all the months.  This information will be cached.
Then we should prune and avoid reading unneeded data.


Option 1 does not require this scan, but is more work for the developer.

On Tue, Feb 2, 2016 at 10:07 AM, Wei Chen <wei.chen.ri...@gmail.com> wrote:

> Hi All,
>
> I have data partitioned by year=yyyy/month=mm/day=dd, what is the best way
> to get two months of data from a given year (let's say June and July)?
>
> Two ways I can think of:
> 1. use unionAll
> df1 = sqc.read.parquet('xxx/year=2015/month=6')
> df2 = sqc.read.parquet('xxx/year=2015/month=7')
> df = df1.unionAll(df2)
>
> 2. use filter after load the whole year
> df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')
>
> Which of the above is better? Or are there better ways to handle this?
>
>
> Thank you,
> Wei
>

Re: optimal way to load parquet files with partition

Reply via email to