It depends how many partitions you have and if you are only doing a single operation. Loading all the data and filtering will require us to scan the directories to discover all the months. This information will be cached. Then we should prune and avoid reading unneeded data.
Option 1 does not require this scan, but is more work for the developer. On Tue, Feb 2, 2016 at 10:07 AM, Wei Chen <wei.chen.ri...@gmail.com> wrote: > Hi All, > > I have data partitioned by year=yyyy/month=mm/day=dd, what is the best way > to get two months of data from a given year (let's say June and July)? > > Two ways I can think of: > 1. use unionAll > df1 = sqc.read.parquet('xxx/year=2015/month=6') > df2 = sqc.read.parquet('xxx/year=2015/month=7') > df = df1.unionAll(df2) > > 2. use filter after load the whole year > df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)') > > Which of the above is better? Or are there better ways to handle this? > > > Thank you, > Wei >