Don't call .collect if your data size huge, you can simply do a count() to
trigger the execution.

Can you paste your exception stack trace so that we'll know whats happening?

Thanks
Best Regards

On Fri, Mar 27, 2015 at 9:18 PM, Zsolt Tóth <toth.zsolt....@gmail.com>
wrote:

> Hi,
>
> I have a simple Spark application: it creates an input rdd with
> sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The
> output rdd is small, a few MB's. Then I call collect() on the output.
>
> If the textfile is ~50GB, it finishes in a few minutes. However, if it's
> larger (~100GB) the execution hangs at the end of the collect() stage. The
> UI shows one active job (collect); one completed (flatMapToPair) and one
> active stage (collect). The collect stage has 880/892 tasks succeeded so I
> think the issue should happen when the whole job is finished (every task on
> the UI is either in SUCCESS or in RUNNING state).
> The driver and the containers don't log anything for 15 mins, then I get
> Connection time out.
>
> I run the job in yarn-cluster mode on Amazon EMR with Spark 1.2.1 and
> Hadoop 2.4.0.
>
> This happens every time I run the process with larger input data so I
> think this isn't just a connection issue or something like that. Is this a
> Spark bug or something is wrong with my setup?
>
> Zsolt
>

Reply via email to