Hi All, I am using the pyspark structure streaming with Azure Databricks for data load process.
In the Pipeline I am using a Job cluster and I am running only one pipeline, I am getting the OUT OF MEMORY issue while running for a long time. When I inspect the metrics of the cluster I found that, the memory usage getting increased by time by time even when there is no huge volume of data. [image: image.png] [image: image.png] After 4 hours of running the pipeline continuously, I am getting out of memory issue where used memory in the driver getting increased from 47 GB to 111 GB which is almost triple, I am unable to understand why this many memory occcupied in the driver. Am I missing anything here to notice? Could you guide me to figure out the root cause? Note: 1. I confirmed persist and unpersist that I used in code taken care properly for every batch execution. 2. Data is not increasing when time passes, (stream data getting almost same amount of data for every batch) Thanks,