I thought I would share something valuable that Jacob Perkins (who recently started with us) shared. We were seeing blacklisted task trackers and occasionally failed jobs. These were almost always based on TimedOutExceptions from Cassandra. We've been fixing underlying reasons for those exceptions. However, one thing Jacob found when getting timeout errors with elastic search + hadoop, if he gave elastic search a few more tries before failing the jobs, things finished. So he cranked those up. Granted if you crank them too high, your jobs that might have otherwise failed, don't have a chance to fail. But for us, it was that we just needed to generally give Cassandra a few more tries. We're still getting the gremlins out here and there, but you can set this at the job level or on the task trackers themselves. It gives Cassandra a few more tries for each task for that job so that it doesn't blacklist that node for the job as quickly and doesn't fail the job as easily. An example configuration (for job configuration or for the task trackers' mapred-site.xml) is:
<property> <name>mapred.max.tracker.failures</name> <value>20</value> </property> <property> <name>mapred.max.tracker.failures</name> <value>20</value> </property> <property> <name>mapred.map.max.attempts</name> <value>20</value> </property> <property> <name>mapred.reduce.max.attempts</name> <value>20</value> </property> Just thought I would share this because I've seen others experience this problem. It's not a complete solution but it can come in handy if you want to make Hadoop more fault tolerant with Cassandra.