If the wildcard path you have doesn't work you should probably open a bug -- I had a similar problem with Parquet and it was a bug which recently got closed. Not sure if sqlContext.avroFile shares a codepath with .parquetFile...you can try running with bits that have the fix for .parquetFile or look at the source... Here was my question for reference: http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E
On Wed, Jan 14, 2015 at 4:34 AM, David Jones <letsnumsperi...@gmail.com> wrote: > Hi, > > I have a program that loads a single avro file using spark SQL, queries > it, transforms it and then outputs the data. The file is loaded with: > > val records = sqlContext.avroFile(filePath) > val data = records.registerTempTable("data") > ... > > > Now I want to run it over tens of thousands of Avro files (all with > schemas that contain the fields I'm interested in). > > Is it possible to load multiple avro files recursively from a top-level > directory using wildcards? All my avro files are stored under > s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of > these on EMR. > > If that's not possible, is there some way to load multiple avro files into > the same table/RDD so the whole dataset can be processed (and in that case > I'd supply paths to each file concretely, but I *really* don't want to have > to do that). > > Thanks > David >