Hi there, I've been using Spark for processing 33,000 gzipped files that contain billions of JSON records (the metadata [WAT] dataset from Common Crawl). I've hit a few issues and have not yet found the answers from the documentation / search. This may well just be me not finding the right pages though I promise I've attempted to RTFM thoroughly!
Is there any way to (a) ensure retry attempts are done on different nodes and/or (b) ensure there's a delay between retrying a failing task (similar to spark.shuffle.io.retryWait)? Optimally when a task fails it should be given to different executors*. This is not the case that I've seen. With maxFailures set to 16, the task is handed back to the same executor 16 times, even though there are 89 other nodes. The retry attempts are incredibly fast. The transient issue disappears quickly (DNS resolution fails to an Amazon bucket) but the 16 retry attempts take less than a second, all run on the same flaky node. For now I've set the maxFailures to an absurdly high number and that has worked around the issue -- the DNS error disappears on the specified machine after ~22 seconds (~360 task attempts) -- but that's obviously suboptimal. Additionally, are there other options for handling node failures? Other than maxFailures I've only seen things relating to shuffle failures? In the one instance I've had a node lose communication, it killed the job. I'd assumed the RDD would reconstruct. For now I've tried to work around it by persisting to multiple machines (MEMORY_AND_DISK_SER_2). Thanks! ^_^ -- Regards, Stephen Merity Data Scientist @ Common Crawl