Re: Best approach for processing all files parallelly

2016-10-10 Thread ayan guha
Hi Sorry for confusion, but I meant those functions to be written by you. Those are you r business logic or etl logic On 10 Oct 2016 21:06, "Arun Patel" wrote: > Ayan, which version of Python are you using? I am using 2.6.9 and I don't > find generateFileType and getSchemaFor functions. Thanks

Re: Best approach for processing all files parallelly

2016-10-10 Thread Arun Patel
Ayan, which version of Python are you using? I am using 2.6.9 and I don't find generateFileType and getSchemaFor functions. Thanks for your help. On Fri, Oct 7, 2016 at 1:17 AM, ayan guha wrote: > Hi > > generateFileType (filename) returns FileType > > getSchemaFor(FileType) returns schema for

Re: Best approach for processing all files parallelly

2016-10-06 Thread ayan guha
Hi generateFileType (filename) returns FileType getSchemaFor(FileType) returns schema for FileType This for loop DOES NOT process files sequentially. It creates dataframes on all files which are of same types sequentially. On Fri, Oct 7, 2016 at 12:08 AM, Arun Patel wrote: > Thanks Ayan. Cou

Re: Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
Thanks Ayan. Couple of questions: 1) How does generateFileType and getSchemaFor functions look like? 2) 'For loop' is processing files sequentially, right? my requirement is to process all files at same time. On Thu, Oct 6, 2016 at 8:52 AM, ayan guha wrote: > Hi > > In this case, if you see, t

Re: Best approach for processing all files parallelly

2016-10-06 Thread ayan guha
Hi In this case, if you see, t[1] is NOT the file content, as I have added a "FileType" field. So, this collect is just bringing in the list of file types, should be fine On Thu, Oct 6, 2016 at 11:47 PM, Arun Patel wrote: > Thanks Ayan. I am really concerned about the collect. > > types = rdd1

Re: Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
Thanks Ayan. I am really concerned about the collect. types = rdd1.map(lambda t: t[1]).distinct().collect() This will ship all files on to the driver, right? It must be inefficient. On Thu, Oct 6, 2016 at 7:58 AM, ayan guha wrote: > Hi > > I think you are correct direction. What is missing

Re: Best approach for processing all files parallelly

2016-10-06 Thread ayan guha
Hi I think you are correct direction. What is missing is: You do not need to create DF for each file. You can scramble files with similar structures together (by doing some filter on file name) and then create a DF for each type of file. Also, creating DF on wholeTextFile seems wasteful to me. I w