Don't call .collect if your data size huge, you can simply do a count() to trigger the execution.
Can you paste your exception stack trace so that we'll know whats happening? Thanks Best Regards On Fri, Mar 27, 2015 at 9:18 PM, Zsolt Tóth <toth.zsolt....@gmail.com> wrote: > Hi, > > I have a simple Spark application: it creates an input rdd with > sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The > output rdd is small, a few MB's. Then I call collect() on the output. > > If the textfile is ~50GB, it finishes in a few minutes. However, if it's > larger (~100GB) the execution hangs at the end of the collect() stage. The > UI shows one active job (collect); one completed (flatMapToPair) and one > active stage (collect). The collect stage has 880/892 tasks succeeded so I > think the issue should happen when the whole job is finished (every task on > the UI is either in SUCCESS or in RUNNING state). > The driver and the containers don't log anything for 15 mins, then I get > Connection time out. > > I run the job in yarn-cluster mode on Amazon EMR with Spark 1.2.1 and > Hadoop 2.4.0. > > This happens every time I run the process with larger input data so I > think this isn't just a connection issue or something like that. Is this a > Spark bug or something is wrong with my setup? > > Zsolt >