Hi All,

I am using the pyspark structure streaming with Azure Databricks for data
load process.

In the Pipeline I am using a Job cluster and I am running only one
pipeline, I am getting the OUT OF MEMORY issue while running for a
long time. When I inspect the metrics of the cluster I found that, the
memory usage getting increased by time by time even when there is no
huge volume of data.

[image: image.png]


[image: image.png]

After 4 hours of running the pipeline continuously, I am getting out of
memory issue where used memory in the driver getting increased from 47 GB
to 111 GB which is almost triple, I am unable to understand why this many
memory occcupied in the driver. Am I missing anything here to notice? Could
you guide me to figure out the root cause?

Note:
1. I confirmed persist and unpersist that I used in code taken care
properly for every batch execution.
2. Data is not increasing when time passes, (stream data getting almost
same amount of data for every batch)


Thanks,

Reply via email to