Re: Job Manager taking long time to upload job graph on remote storage

2020-09-04 Thread Till Rohrmann
I am not sure at this point that the delay is caused by Flink. I would rather suspect that it has something to do with an external system. Maybe you could try profiling the job submission so that we see clearer where the time is spent. Other than that, there might be some options for the GCS filesy

Re: Job Manager taking long time to upload job graph on remote storage

2020-09-04 Thread Prakhar Mathur
Yes, we can try the same in 1.11. Meanwhile is there any network or threads related config that we can tweak for this? On Fri, Sep 4, 2020 at 12:48 PM Till Rohrmann wrote: > From the log snippet it is hard to tell. Flink is not only interacting > with GCS but also with ZooKeeper to store a point

Re: Job Manager taking long time to upload job graph on remote storage

2020-09-04 Thread Till Rohrmann
>From the log snippet it is hard to tell. Flink is not only interacting with GCS but also with ZooKeeper to store a pointer to the serialized JobGraph. This can also take some time. Then of course, there could be an issue with the GS filesystem implementation you are using. The fs throughput could

Re: Job Manager taking long time to upload job graph on remote storage

2020-09-03 Thread Prakhar Mathur
Yes, I will check that, but any pointers on why Flink is taking more time than gsutil upload? On Thu, Sep 3, 2020 at 10:14 PM Till Rohrmann wrote: > Hmm then it probably rules GCS out. What about ZooKeeper? Have you > experienced slow response times from your ZooKeeper cluster? > > Cheers, > Til

Re: Job Manager taking long time to upload job graph on remote storage

2020-09-03 Thread Till Rohrmann
Hmm then it probably rules GCS out. What about ZooKeeper? Have you experienced slow response times from your ZooKeeper cluster? Cheers, Till On Thu, Sep 3, 2020 at 6:23 PM Prakhar Mathur wrote: > We tried uploading the same blob from Job Manager k8s pod directly to GCS > using gsutils and it to

Re: Job Manager taking long time to upload job graph on remote storage

2020-09-03 Thread Prakhar Mathur
We tried uploading the same blob from Job Manager k8s pod directly to GCS using gsutils and it took 2 seconds. The upload speed was 166.8 MiB/s. Thanks. On Wed, Sep 2, 2020 at 6:14 PM Till Rohrmann wrote: > The logs don't look suspicious. Could you maybe check what the write > bandwidth to your

Re: Job Manager taking long time to upload job graph on remote storage

2020-09-02 Thread Till Rohrmann
The logs don't look suspicious. Could you maybe check what the write bandwidth to your GCS bucket is from the machine you are running Flink on? It should be enough to generate a 200 MB file and write it to GCS. Thanks a lot for your help in debugging this matter. Cheers, Till On Wed, Sep 2, 2020

Re: Job Manager taking long time to upload job graph on remote storage

2020-09-02 Thread Prakhar Mathur
Hi, Thanks for the response. Yes, we are running Flink in HA mode. We checked there are no such quota limits for GCS for us. Please find the logs below, here you can see the copying of blob started at 11:50:39,455 and it got JobGraph submission at 11:50:46,400. 2020-09-01 11:50:37,061 DEBUG org.a

Re: Job Manager taking long time to upload job graph on remote storage

2020-09-02 Thread Till Rohrmann
Hi Prakhar, have you enabled HA for your cluster? If yes, then Flink will try to store the job graph to the configured high-availability.storageDir in order to be able to recover it. If this operation takes long, then it is either the filesystem which is slow or storing the pointer in ZooKeeper. I

Job Manager taking long time to upload job graph on remote storage

2020-09-02 Thread Prakhar Mathur
Hi, We are currently running Flink 1.9.0. We see a delay of around 20 seconds in order to start a job on a session Flink cluster. We start the job using Flink's monitoring REST API where our jar is already uploaded on Job Manager. Our jar file size is around 200 MB. We are using memory state backe