Re: multiple hdfs folder & files input to PySpark

Ai He Tue, 05 May 2015 23:09:04 -0700

Hi Oleg,

For 1, RDD#union will help. You can iterate over folders and union the obtained 
RDD along.


For 2, seems like it won’t work in a deterministic way according to this 
discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontext-textfile).

Thanks
> On May 5, 2015, at 5:59 AM, Oleg Ruchovets <[email protected]> wrote:
> 
> Hi 
>    We are using pyspark 1.3 and input is text files located on hdfs.
> 
> file structure 
>     <day1>
>                 file1.txt
>                 file2.txt
>     <day2>
>                 file1.txt
>                 file2.txt
>      ...
> 
> Question:
> 
>    1) What is the way to provide as an input for PySpark job  multiple files 
> which located in Multiple folders (on hdfs).
> Using textFile method works fine for single file or folder , but how can I do 
> it using multiple folders?
> Is there a way to pass array , list of files?
>    
>    2) What is the meaning of partition parameter in textFile method?
> 
>   sc = SparkContext(appName="TAD")
>   lines = sc.textFile(<my input>, 1)
> 
> Thanks
> Oleg.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: multiple hdfs folder & files input to PySpark

Reply via email to