You stop the Spark job by tasks failing repeatedly, that's already how it works. You can't kill the driver from the executor other ways, but should not need to. I'm not clear, you're saying you want to stop the job, but also continue processing?
On Wed, Feb 16, 2022 at 7:58 AM S <sheelst...@gmail.com> wrote: > Retries have been already implemented. The question is how to stop the > spark job by having an executor JVM send a signal to the driver JVM. e.g. I > have a microbatch of 30 messages; 10 in each of the 3 partitions. Let's say > while a partition of 10 messages was being processed, first 3 went through > but then the microservice went down. Now when the 4th message in the > partition is sent to the microservice it keeps receiving 5XX on every retry > e.g. 5 retries. What I now want is to have that task from that executor JVM > send a signal to the driver JVM to terminate the spark job on the failure > of the 5th retry. Currently, what we have in place is retrying it 5 times > and then upon failure i.e. 5XX catch the exception and move the message to > a DLQ thereby having the flatmap produce a *None* and proceed to the next > message in the partition of that microbatch. This approach keeps the > pipeline alive and keeps pushing messages to DLQ microbatch after > microbatch until the microservice is back up. > > > On Wed, Feb 16, 2022 at 6:50 PM Sean Owen <sro...@gmail.com> wrote: > >> You could use the same pattern in your flatMap function. If you want >> Spark to keep retrying though, you don't need any special logic, that is >> what it would do already. You could increase the number of task retries >> though; see the spark.excludeOnFailure.task.* configurations. >> >> You can just implement the circuit breaker pattern directly too, nothing >> special there, though I don't think that's what you want? you actually want >> to retry the failed attempts, not just avoid calling the microservice. >> >> On Wed, Feb 16, 2022 at 3:18 AM S <sheelst...@gmail.com> wrote: >> >>> Hi, >>> >>> We have a spark job that calls a microservice in the lambda function of >>> the flatmap transformation -> passes to this microservice, the inbound >>> element in the lambda function and returns the transformed value or "None" >>> from the microservice as an output of this flatMap transform. Of course the >>> lambda also takes care of exceptions from the microservice etc.. The >>> question is: there are times when the microservice may be down and there is >>> no point recording an exception and putting the message in the DLQ for >>> every element in our streaming pipeline so long as the microservice stays >>> down. Instead we want to be able to do is retry the microservice call for a >>> given event for a predefined no. of times and if found to be down then >>> terminate the spark job so that this current microbatch is terminated and >>> there is no next microbatch and the rest of the messages continue therefore >>> continue to be in the source kafka topics unpolled and therefore >>> unprocesseed. until the microservice is back up and the spark job is >>> redeployed again. In regular microservices, we can implement this using the >>> Circuit breaker pattern. In Spark jobs however this would mean, being able >>> to somehow send a signal from an executor JVM to the driver JVM to >>> terminate the Spark job. Is there a way to do that in Spark? >>> >>> P.S.: >>> - Having the circuit breaker functionality helps specificize the purpose >>> of the DLQ to data or schema issues only instead of infra/network related >>> issues. >>> - As far as the need for the Spark job to use microservices is >>> concerned, think of it as a complex logic being maintained in a >>> microservice that does not warrant duplication. >>> - checkpointing is being taken care of manually and not using spark's >>> default checkpointing mechanism. >>> >>> Regards, >>> Sheel >>> >> > > -- > > Best Regards, > > Sheel Pancholi >