You can create a partitioned hive table using Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates <da...@codeaholics.org> wrote: > Hi, > > I've got a bunch of data stored in S3 under directories like this: > > s3n://blah/y=2015/m=01/d=25/lots-of-files.csv > > In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that > it only scans the necessary directories for files to read. > > As far as I can tell from searching and reading the docs, the right way of > loading this data into Spark is to use sc.textFile("s3n://blah/*/*/*/") > > 1) Is there any way in Spark to access y, m and d as fields? In Hive, you > declare them in the schema, but you don't put them in the CSV files - their > values are extracted from the path. > 2) Is there any way to get Spark to use the y, m and d fields to minimise > the files it transfers from S3? > > Thanks, > > Danny. >