I have temporary fix for my case. My sample file was 2G / 50M lines in size.
My initial configuration was 1000 splits.

Based on my understanding of distributed algorithms, number of splits can
affect the memory pressure in operations such as distinct and reduceByKey.
So i tried to reduce the number of splits from 1000 to 100. Now I can run
distinct and reduceByKey on files that are 2G / 50M lines.

Unfortunately it still doesn't scale well.

Thanks. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-in-connection-from-worker-to-worker-tp20983p21302.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to