Hi there,

I've been using Spark for processing 33,000 gzipped files that contain
billions of JSON records (the metadata [WAT] dataset from Common Crawl).
I've hit a few issues and have not yet found the answers from the
documentation / search. This may well just be me not finding the right
pages though I promise I've attempted to RTFM thoroughly!

Is there any way to (a) ensure retry attempts are done on different nodes
and/or (b) ensure there's a delay between retrying a failing task (similar
to spark.shuffle.io.retryWait)?

Optimally when a task fails it should be given to different executors*.
This is not the case that I've seen. With maxFailures set to 16, the task
is handed back to the same executor 16 times, even though there are 89
other nodes.

The retry attempts are incredibly fast. The transient issue disappears
quickly (DNS resolution fails to an Amazon bucket) but the 16 retry
attempts take less than a second, all run on the same flaky node.

For now I've set the maxFailures to an absurdly high number and that has
worked around the issue -- the DNS error disappears on the specified
machine after ~22 seconds (~360 task attempts) -- but that's obviously
suboptimal.

Additionally, are there other options for handling node failures? Other
than maxFailures I've only seen things relating to shuffle failures? In the
one instance I've had a node lose communication, it killed the job. I'd
assumed the RDD would reconstruct. For now I've tried to work around it by
persisting to multiple machines (MEMORY_AND_DISK_SER_2).

Thanks! ^_^

-- 
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Reply via email to