I have temporary fix for my case. My sample file was 2G / 50M lines in size. My initial configuration was 1000 splits.
Based on my understanding of distributed algorithms, number of splits can affect the memory pressure in operations such as distinct and reduceByKey. So i tried to reduce the number of splits from 1000 to 100. Now I can run distinct and reduceByKey on files that are 2G / 50M lines. Unfortunately it still doesn't scale well. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-in-connection-from-worker-to-worker-tp20983p21302.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org