I have gone through debug logs of jobs. There are no failures or exceptions in logs. This issue does not seem to be specific to jobs as several of our jobs have been impacted by this issue and these same jobs pass also on retry.
I am trying to figure out why the driver pod is getting deleted when this issue occurs. Even if there was some error, driver pod should remain there in Error state. What could be the potential reasons for driver pod deletion so that we can investage in that direction? Regards, Shrikant On Sat, 29 Oct 2022 at 1:14 PM, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Maybe enabling DEBUG level log in your job and follow the processing logic > until the failure? > > BTW, you need to look at what happens during job processing. > > `Spark Context was shutdown` is not the root cause, but the result of job > failure in most cases. > > Dongjoon. > > On Fri, Oct 28, 2022 at 12:10 AM Shrikant Prasad <shrikant....@gmail.com> > wrote: > >> Thanks Dongjoon for replying. I have tried with Spark 3.2 and still >> facing the same issue. >> >> Looking for some pointers which can help in debugging to find the >> root cause. >> >> Regards, >> Shrikant >> >> On Thu, 27 Oct 2022 at 10:36 PM, Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Hi, Shrikant. >>> >>> It seems that you are using non-GA features. >>> >>> FYI, since Apache Spark 3.1.1, Kubernetes Support became GA in the >>> community. >>> >>> https://spark.apache.org/releases/spark-release-3-1-1.html >>> >>> In addition, Apache Spark 3.1 reached EOL last month. >>> >>> Could you try the latest distribution like Apache Spark 3.3.1 to see >>> that you are still experiencing the same issue? >>> >>> It will reduce the scope of your issues by excluding many known and >>> fixed bugs at 3.0/3.1/3.2/3.3.0. >>> >>> Thanks, >>> Dongjoon. >>> >>> >>> On Wed, Oct 26, 2022 at 11:16 PM Shrikant Prasad <shrikant....@gmail.com> >>> wrote: >>> >>>> Hi Everyone, >>>> >>>> We are using Spark 3.0.1 with Kubernetes resource manager. Facing an >>>> intermittent issue in which the driver pod gets deleted and the driver logs >>>> have this message that Spark Context was shutdown. >>>> >>>> The same job works fine with given set of configurations most of the >>>> time but sometimes it fails. It mostly occurs while reading or writing >>>> parquet files to hdfs. (but not sure if it's the only usecase affected) >>>> >>>> Any pointers to find the root cause? >>>> >>>> Most of the earlier reported issues mention executors getting OOM as >>>> the cause. But we have not seen an OOM error in any of executors. Also, why >>>> the context will be shutdown in this case instead of retrying with new >>>> executors. >>>> Another doubt is why the driver pod gets deleted. Shouldn't it just >>>> error out? >>>> >>>> Regards, >>>> Shrikant >>>> >>>> -- >>>> Regards, >>>> Shrikant Prasad >>>> >>> -- >> Regards, >> Shrikant Prasad >> > -- Regards, Shrikant Prasad