Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
18, 2016 10:17 AM > *To:* Nathan Lande > *Cc:* Keith Bourgoin; Irina Truong; u...@spark.incubator.apache.org > *Subject:* Re: Long-running job OOMs driver process > > +1 for using S3A. > > It would also depend on what format you're using. I agree with Steve that > Parquet

Re: Long-running job OOMs driver process

2016-11-18 Thread Yong Zhang
: Long-running job OOMs driver process +1 for using S3A. It would also depend on what format you're using. I agree with Steve that Parquet, for instance, is a good option. If you're using plain text files, some people use GZ files but they cannot be partitioned, thus putting a lot of p

Re: Long-running job OOMs driver process

2016-11-18 Thread Alexis Seigneurin
+1 for using S3A. It would also depend on what format you're using. I agree with Steve that Parquet, for instance, is a good option. If you're using plain text files, some people use GZ files but they cannot be partitioned, thus putting a lot of pressure on the driver. It doesn't look like this is

Re: Long-running job OOMs driver process

2016-11-18 Thread Nathan Lande
+1 to not threading. What does your load look like? If you are loading many files and cacheing them in N rdds rather than 1 rdd this could be an issue. If the above two things don't fix your oom issue, without knowing anything else about your job, I would focus on your cacheing strategy as a pote

Re: Long-running job OOMs driver process

2016-11-18 Thread Steve Loughran
On 18 Nov 2016, at 14:31, Keith Bourgoin mailto:ke...@parsely.com>> wrote: We thread the file processing to amortize the cost of things like getting files from S3. Define cost here: actual $ amount, or merely time to read the data? If it's read times, you should really be trying the new stuff

Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
Hi Alexis, Thanks for the response. I've been working with Irina on trying to sort this issue out. We thread the file processing to amortize the cost of things like getting files from S3. It's a pattern we've seen recommended in many places, but I don't have any of those links handy. The problem

Re: Long-running job OOMs driver process

2016-11-17 Thread Alexis Seigneurin
Hi Irina, I would question the use of multiple threads in your application. Since Spark is going to run the processing of each DataFrame on all the cores of your cluster, the processes will be competing for resources. In fact, they would not only compete for CPU cores but also for memory. Spark i

Long-running job OOMs driver process

2016-11-17 Thread Irina Truong
We have an application that reads text files, converts them to dataframes, and saves them in Parquet format. The application runs fine when processing a few files, but we have several thousand produced every day. When running the job for all files, we have spark-submit killed on OOM: # # java.lang