Can Spark benefit from Hive-like partitions?

Danny Yates Mon, 26 Jan 2015 05:42:51 -0800

Hi,

I've got a bunch of data stored in S3 under directories like this:


s3n://blah/y=2015/m=01/d=25/lots-of-files.csv

In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that
it only scans the necessary directories for files to read.

As far as I can tell from searching and reading the docs, the right way of
loading this data into Spark is to use sc.textFile("s3n://blah/*/*/*/")

1) Is there any way in Spark to access y, m and d as fields? In Hive, you
declare them in the schema, but you don't put them in the CSV files - their
values are extracted from the path.
2) Is there any way to get Spark to use the y, m and d fields to minimise
the files it transfers from S3?

Thanks,

Danny.

Can Spark benefit from Hive-like partitions?

Reply via email to