sc.wholeTextFiles() <http://spark.apache.org/docs/latest/api/python/pyspark.context.SparkContext-class.html#wholeTextFiles> will get you close. Alternately, you could write a loop with plain sc.textFile() that loads all the files under each batch into a separate RDD.
On Sun, Jun 1, 2014 at 4:40 PM, Oleg Proudnikov <oleg.proudni...@gmail.com> wrote: > I have a large number of directories under a common root: > > batch-1/file1.txt > batch-1/file2.txt > batch-1/file3.txt > ... > batch-2/file1.txt > batch-2/file2.txt > batch-2/file3.txt > ... > batch-N/file1.txt > batch-N/file2.txt > batch-N/file3.txt > ... > > I would like to read them into an RDD like > > { > "batch-1" : [ content1, content2, content3,...] > "batch-2" : [ content1, content2, content3,...] > ... > "batch-N" : [ content1, content2, content3,...] > } > > Thank you, > Oleg > > > > On 1 June 2014 17:00, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > >> Could you provide an example of what you mean? >> >> I know it's possible to create an RDD from a path with wildcards, like in >> the subject. >> >> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also >> provide a comma delimited list of paths. >> >> Nick >> >> 2014년 6월 1일 일요일, Oleg Proudnikov<oleg.proudni...@gmail.com>님이 작성한 메시지: >> >> Hi All, >>> >>> Is it possible to create an RDD from a directory tree of the following >>> form? >>> >>> RDD[(PATH, Seq[TEXT])] >>> >>> Thank you, >>> Oleg >>> >>> > > > -- > Kind regards, > > Oleg > >