I thought I would share something valuable that Jacob Perkins (who recently 
started with us) shared.  We were seeing blacklisted task trackers and 
occasionally failed jobs.  These were almost always based on TimedOutExceptions 
from Cassandra.  We've been fixing underlying reasons for those exceptions.  
However, one thing Jacob found when getting timeout errors with elastic search 
+ hadoop, if he gave elastic search a few more tries before failing the jobs, 
things finished.  So he cranked those up.  Granted if you crank them too high, 
your jobs that might have otherwise failed, don't have a chance to fail.  But 
for us, it was that we just needed to generally give Cassandra a few more 
tries.  We're still getting the gremlins out here and there, but you can set 
this at the job level or on the task trackers themselves.  It gives Cassandra a 
few more tries for each task for that job so that it doesn't blacklist that 
node for the job as quickly and doesn't fail the job as easily.  An example 
configuration (for job configuration or for the task trackers' mapred-site.xml) 
is:

<property>
  <name>mapred.max.tracker.failures</name>
  <value>20</value>
</property>
<property>
  <name>mapred.max.tracker.failures</name>
  <value>20</value>
</property>
<property>
  <name>mapred.map.max.attempts</name>
  <value>20</value>
</property>
<property>
  <name>mapred.reduce.max.attempts</name>
  <value>20</value>
</property>

Just thought I would share this because I've seen others experience this 
problem.  It's not a complete solution but it can come in handy if you want to 
make Hadoop more fault tolerant with Cassandra.

Reply via email to