I'd open an issue on the github to ask us to allow you to use hadoops glob file format for the path.
On Thu, Jan 15, 2015 at 4:57 AM, David Jones <letsnumsperi...@gmail.com> wrote: > I've tried this now. Spark can load multiple avro files from the same > directory by passing a path to a directory. However, passing multiple paths > separated with commas didn't work. > > > Is there any way to load all avro files in multiple directories using > sqlContext.avroFile? > > On Wed, Jan 14, 2015 at 3:53 PM, David Jones <letsnumsperi...@gmail.com> > wrote: > >> Should I be able to pass multiple paths separated by commas? I haven't >> tried but didn't think it'd work. I'd expected a function that accepted a >> list of strings. >> >> On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska <yana.kadiy...@gmail.com> >> wrote: >> >>> If the wildcard path you have doesn't work you should probably open a >>> bug -- I had a similar problem with Parquet and it was a bug which recently >>> got closed. Not sure if sqlContext.avroFile shares a codepath with >>> .parquetFile...you >>> can try running with bits that have the fix for .parquetFile or look at the >>> source... >>> Here was my question for reference: >>> >>> http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E >>> >>> On Wed, Jan 14, 2015 at 4:34 AM, David Jones <letsnumsperi...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I have a program that loads a single avro file using spark SQL, queries >>>> it, transforms it and then outputs the data. The file is loaded with: >>>> >>>> val records = sqlContext.avroFile(filePath) >>>> val data = records.registerTempTable("data") >>>> ... >>>> >>>> >>>> Now I want to run it over tens of thousands of Avro files (all with >>>> schemas that contain the fields I'm interested in). >>>> >>>> Is it possible to load multiple avro files recursively from a top-level >>>> directory using wildcards? All my avro files are stored under >>>> s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of >>>> these on EMR. >>>> >>>> If that's not possible, is there some way to load multiple avro files >>>> into the same table/RDD so the whole dataset can be processed (and in that >>>> case I'd supply paths to each file concretely, but I *really* don't want to >>>> have to do that). >>>> >>>> Thanks >>>> David >>>> >>> >>> >> >