Hi Folks, We are testing our home-made KMeans algorithm using Spark on Yarn. Recently, we've found that the application failed frequently when doing clustering over 300,000,000 users (each user is represented by a feature vector and the whole data set is around 600,000,000). After digging into the job log, we've found that there are many CancelledKeyException throwed by ConnectionManager but not observed other exceptions. We double frequent CancelledKeyException brings the whole application down since the application often failed on the third or fourth iteration for large datasets. Welcome to any directional suggestions.
*Errors in job log*: java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-289.rfiserve.net,43199) 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@2570cd62 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@2570cd62 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-289.rfiserve.net,56727) 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@37c8b85a 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@37c8b85a java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116) 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(lsv-668.rfiserve.net,41913) 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(lsv-668.rfiserve.net,41913) 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@fcea3a4 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@fcea3a4 Best Shengzhe