sc.wholeTextFiles()
<http://spark.apache.org/docs/latest/api/python/pyspark.context.SparkContext-class.html#wholeTextFiles>
will
get you close. Alternately, you could write a loop with plain sc.textFile()
that loads all the files under each batch into a separate RDD.


On Sun, Jun 1, 2014 at 4:40 PM, Oleg Proudnikov <oleg.proudni...@gmail.com>
wrote:

> I have a large number of directories under a common root:
>
> batch-1/file1.txt
> batch-1/file2.txt
> batch-1/file3.txt
> ...
> batch-2/file1.txt
> batch-2/file2.txt
> batch-2/file3.txt
> ...
> batch-N/file1.txt
> batch-N/file2.txt
> batch-N/file3.txt
> ...
>
> I would like to read them into an RDD like
>
> {
> "batch-1" : [ content1, content2, content3,...]
> "batch-2" : [ content1, content2, content3,...]
> ...
> "batch-N" : [ content1, content2, content3,...]
> }
>
> Thank you,
> Oleg
>
>
>
> On 1 June 2014 17:00, Nicholas Chammas <nicholas.cham...@gmail.com> wrote:
>
>> Could you provide an example of what you mean?
>>
>> I know it's possible to create an RDD from a path with wildcards, like in
>> the subject.
>>
>> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
>> provide a comma delimited list of paths.
>>
>> Nick
>>
>> 2014년 6월 1일 일요일, Oleg Proudnikov<oleg.proudni...@gmail.com>님이 작성한 메시지:
>>
>> Hi All,
>>>
>>> Is it possible to create an RDD from a directory tree of the following
>>> form?
>>>
>>> RDD[(PATH, Seq[TEXT])]
>>>
>>> Thank you,
>>> Oleg
>>>
>>>
>
>
> --
> Kind regards,
>
> Oleg
>
>

Reply via email to