Hi folks,

I was trying to debug a job which was taking 20-30s to checkpoint data to Azure 
FS (compared to typically < 5s) and as part of doing so, I noticed something 
that I was trying to figure out a bit better.
Our checkpoint path is as follows: 
my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint/chk-1234

What I noticed was that while trying to take checkpoints (incremental using 
rocksDB) we make a number of List calls to Azure:
my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint
my_user/featureflow/foo-datacenter/cluster_name/my_flink_job
my_user/featureflow/foo-datacenter/cluster_name
my_user/featureflow/foo-datacenter
my_user/featureflow
my_user

Each of these calls takes a few seconds and all of them seem to add up to make 
our checkpoint take time. The part I was hoping to understand on the Flink side 
was whether the behavior of making these List calls for each parent ‘directory’ 
/ blob all the way to the top was normal / expected?

We are exploring a couple of other angles on our end (potentially flattening 
the directory / blob structure to reduce the number of these calls, is the 
latency on the Azure side expected), but along with this I was hoping to 
understand if this behavior on the Flink side is expected / if there’s 
something which we could optimize as well.

Thanks,

-- Piyush

Reply via email to