Dear Flink community,
I recently running into this issue at a job startup. It happened from time to
time. Here is the exception from the job manager:
2021-08-17 01:21:01,944 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
Defence raw event prod05_analytics_outpu
1.13.1.
>
> Best,
> Yangze Guo
>
> On Tue, Jul 27, 2021 at 9:41 AM Ivan Yang wrote:
>>
>> Dear Flink experts,
>>
>> We recently ran into an issue during a job cancellation after upgraded to
>> 1.13. After we issue a cancel (from Flink console or flin
Dear Flink experts,
We recently ran into an issue during a job cancellation after upgraded to 1.13.
After we issue a cancel (from Flink console or flink cancel {jobid}), a few
subtasks stuck in cancelling state. Once it gets to that situation, the
behavior is consistent. Those “cancelling tasks
o many
> files exist in the S3 bucket?
>
> AFAIK, if the K8s HA services work normally, only one completedCheckpoint
> file will be retained. Once a
> new one is generated, the old one will be deleted.
>
>
> Best,
> Yang
>
> Ivan Yang mailto:ivanygy...@gmail.c
Hi Dear Flink users,
We recently implemented enabled the zookeeper less HA in our kubernetes Flink
deployment. The set up has
high-availability.storageDir: s3://some-bucket/recovery
Since we have a retention policy on the s3 bucket, relatively short 7 days. So
the HA will fail if the submitte
Hi Yun,
Thank you so much for you suggestion.
(1) The job couldn’t restore from the last checkpoint. The exception is in my
original email.
(2) No, I didn’t change any multipart upload settings.
(3) The file is gone. I have another batch process that reads Flink output s3
bucket and pushes obj
Hi all,
We got this exception after a job restart. Does anyone know what may lead to
this situation? and how to get pass this Checkpoint issue? Prior to this, the
job failed due to “Checkpoint expired before completing.” We are s3 heavy,
writing out 10K files to s3 every 10 minutes using Stream
s the root cause (or) if
> the root cause is something else which triggers this issue.
>
> On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <mailto:ivanygy...@gmail.com>> wrote:
> Hi Rahul,
>
> Try to increase taskmanager.network.memory.max to 1GB, basically double what
> y
Hi Rahul,
Try to increase taskmanager.network.memory.max to 1GB, basically double what
you have now. However, you only have 4GB RAM for the entire TM, seems out of
proportion to have 1GB network buffer with 4GB total RAM. Reducing number of
shuffling will require less network buffer. But if you
Jul 24, 2020 at 4:03 AM Ivan Yang <mailto:ivanygy...@gmail.com>> wrote:
> Hello everyone,
>
> We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior
> when we stop a job to a save point got following time out error.
> I checked Flink web console, th
Hello everyone,
We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when
we stop a job to a save point got following time out error.
I checked Flink web console, the save point is created in s3 in 1 second.The
job is fairly simple, so 1 second for savepoint generation is e
Hello,
In Flink web UI Overview tab, "Completed Job List” displays recent completed or
cancelled job only for short period of time. After a while, they are gone. The
Job Manager is up and never restarted. Is there a config key to keep job
history in the Completed Job List for longer time? I am
Hi,
I have setup Filnk 1.9.1 on Kubernetes on AWS EKS with one job manager pod, 10
task manager pods, one pod per EC2 instance. Job runs fine. After a while, for
some reason, one pod (task manager) crashed, then the pod restarted. After
that, the job got into a bad state. All the parallelisms a
Hi,
We have a Flink job that reads data from an input stream, then converts each
event from JSON string Avro object, finally writes to parquet files using
StreamingFileSink with OnCheckPointRollingPolicy of 5 mins. Basically a
stateless job. Initially, we use one map operator to convert Json st
14 matches
Mail list logo