Re: BlockManager issues

2014-09-22 Thread Andrew Ash
Another data point on the 1.1.0 FetchFailures: Running this SQL command works on 1.0.2 but fails on 1.1.0 due to the exceptions mentioned earlier in this thread: "SELECT stringCol, SUM(doubleCol) FROM parquetTable GROUP BY stringCol" The FetchFailure exception has the remote block manager that fa

Re: BlockManager issues

2014-09-22 Thread David Rowe
I've run into this with large shuffles - I assumed that there was contention between the shuffle output files and the JVM for memory. Whenever we start getting these fetch failures, it corresponds with high load on the machines the blocks are being fetched from, and in some cases complete unrespons

Re: BlockManager issues

2014-09-22 Thread Christoph Sawade
Hey all. We had also the same problem described by Nishkam almost in the same big data setting. We fixed the fetch failure by increasing the timeout for acks in the driver: set("spark.core.connection.ack.wait.timeout", "600") // 10 minutes timeout for acks between nodes Cheers, Christoph 2014-09

Re: BlockManager issues

2014-09-22 Thread Hortonworks
Actually I met similar issue when doing groupByKey and then count if the shuffle size is big e.g. 1tb. Thanks. Zhan Zhang Sent from my iPhone > On Sep 21, 2014, at 10:56 PM, Nishkam Ravi wrote: > > Thanks for the quick follow up Reynold and Patrick. Tried a run with > significantly higher ul

Re: BlockManager issues

2014-09-21 Thread Nishkam Ravi
Thanks for the quick follow up Reynold and Patrick. Tried a run with significantly higher ulimit, doesn't seem to help. The executors have 35GB each. Btw, with a recent version of the branch, the error message is "fetch failures" as opposed to "too many open files". Not sure if they are related. P

Re: BlockManager issues

2014-09-21 Thread Patrick Wendell
Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible that you are just having more spilling as a result of the patch and so the filesystem is opening more files. I would try increasing the ulimit. How much memory do your executors have? - Patrick On Sun, Sep 21, 2014 at 10:29 PM,

Re: BlockManager issues

2014-09-21 Thread Patrick Wendell
Hey the numbers you mentioned don't quite line up - did you mean PR 2711? On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin wrote: > It seems like you just need to raise the ulimit? > > > On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi wrote: > >> Recently upgraded to 1.1.0. Saw a bunch of fetch failur

Re: BlockManager issues

2014-09-21 Thread Reynold Xin
It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi wrote: > Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the > workloads. Tried tracing the problem through change set analysis. Looks > like the offending commit is 4fde28c from