Hi Oleg, For 1, RDD#union will help. You can iterate over folders and union the obtained RDD along.
For 2, seems like it won’t work in a deterministic way according to this discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontext-textfile). Thanks > On May 5, 2015, at 5:59 AM, Oleg Ruchovets <[email protected]> wrote: > > Hi > We are using pyspark 1.3 and input is text files located on hdfs. > > file structure > <day1> > file1.txt > file2.txt > <day2> > file1.txt > file2.txt > ... > > Question: > > 1) What is the way to provide as an input for PySpark job multiple files > which located in Multiple folders (on hdfs). > Using textFile method works fine for single file or folder , but how can I do > it using multiple folders? > Is there a way to pass array , list of files? > > 2) What is the meaning of partition parameter in textFile method? > > sc = SparkContext(appName="TAD") > lines = sc.textFile(<my input>, 1) > > Thanks > Oleg. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
