Hi, I'm feeling pain while trying to insert 2-3 millions of records into Mongo using plain Spark RDD. There were so many hidden problems.
I would like to avoid this in future and looking for a way to kill individual spark tasks at specific stage and verify expected behaviour of my Spark job. ideal setup 1. write spark job 2. run spark job on YARN 3. run a tool that kills certain % or number of tasks at specific stage 4. verify results Real world scenario. Mongo spark driver has very optimistic assumption that insert never fails. I've enabled ordered=false for the driver to ignore duplicated records insertion. It kind-a worked before I met speculative execution. - Task failed once because of duplicates. It's expected, another task uploaded same data - Then spark killed same task twice during speculative execution. - The whole job failed since there were 3 fails for a given task and spark.task.maxFailures=4 I didn't get three kills in dev cluster during 100+ runs but got it in production :) Production cluster is a bit noisy. Such a chaos monkey would help to tune my job configuration for production using the dev cluster.