Ryan van Huuksloot created FLINK-36884:
------------------------------------------

             Summary: GCS 503 Error Codes for `flink-checkpoints/<id>/shared/` 
file after upload complete
                 Key: FLINK-36884
                 URL: https://issues.apache.org/jira/browse/FLINK-36884
             Project: Flink
          Issue Type: Bug
          Components: FileSystems, Runtime / Checkpointing
    Affects Versions: 1.18.0
         Environment: We are using Flink 1.18.0 with the gs-plugin.

It is a rare bug but something we have noticed multiple times.
            Reporter: Ryan van Huuksloot
         Attachments: Screenshot 2024-12-10 at 1.46.06 PM.png

We had a Flink pipeline that started to, all of a sudden, fail on a single 
subtask [Image 1]. It does not block the rest of the DAG checkpointing so the 
checkpoint barriers are continuing on.

We investigated the issue and found that the checkpoint was trying to write 
over and over and over. It retried writing the file thousands of times. And the 
issue persisted across checkpoints and savepoints but only failed for one 
specific file. 

An example log:

 
{code:java}
Dec 10, 2024 6:06:05 PM 
com.google.cloud.hadoop.util.RetryHttpInitializer$LoggingResponseHandler 
handleResponse
INFO: Encountered status code 503 when sending PUT request to URL 
'https://storage.googleapis.com/upload/storage/v1/b/<bucket>/o?ifGenerationMatch=0&name=flink-checkpoints/2394318276860454f7b6d1689f770796/shared/7d6bb60b-e0cf-4873-afc1-f2d785a4418e&uploadType=resumable&upload_id=<upload_id>'.
 Delegating to response handler for possible retry.
...{code}
 

{*}It is important to note that the file was in fact there. I am not sure if it 
was complete however it was not an .inprogress.file so I believe it was 
complete{*}.

 

I even tried deleting the file in GCS and waiting for a new checkpoint to occur 
and the same issue persisted.

 

There is no issue when we restarted the job from a savepoint. There seems to be 
only an issue with a very specific file.

 

I also tried it locally. It had a 503 from this endpoint with the same upload_id
{noformat}
https://storage.googleapis.com/upload/storage/v1/<bucket>{noformat}
However worked fine with this API (with a new upload_id)
{noformat}
https://storage.googleapis.com/<path>{noformat}
I could not find the merged file on the Task Manager to try from the pod when 
it was failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to