Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
Hi Nicolaus, I double checked again our hdfs config, it is setting 1 instead of 2. I will try the solution you provided. Thanks again. Best regards Rainie On Wed, Aug 4, 2021 at 10:40 AM Rainie Li wrote: > Thanks for the context Nicolaus. > We are using S3 instead of HDFS. > >

Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
tacktrace: > https://stackoverflow.com/questions/64400280/flink-unable-to-recover-after-yarn-node-termination > Do you replicate data on multiple hdfs nodes like suggested in the answer > there? > > Best, > Nico > > On Wed, Aug 4, 2021 at 9:24 AM Rainie Li wrote: > >> Thanks Ti

Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
he actively > maintained Flink versions (1.12 or 1.13) and try whether it works with this > version. > > Cheers, > Till > > On Tue, Aug 3, 2021 at 9:56 AM Rainie Li wrote: > >> Hi Flink Community, >> >> My flink application is running version 1.9 and it faile

Flink job failure during yarn node termination

2021-08-03 Thread Rainie Li
Hi Flink Community, My flink application is running version 1.9 and it failed to recover (application was running but checkpoint failed and job stopped to process data) during hadoop yarn node termination. *Here is job manager log error:* *2021-07-26 18:02:58,605 INFO org.apache.hadoop.io.retry.

Re: Savepoint failure with operation not found under key

2021-06-29 Thread Rainie Li
I see, then it passed longer than 5 mins. Thanks for the help. Best regards Rainie On Tue, Jun 29, 2021 at 12:29 AM Chesnay Schepler wrote: > How much time has passed between the requests? (You can only query the > status for about 5 minutes) > > On 6/29/2021 6:37 AM, Rai

Re: Savepoint failure with operation not found under key

2021-06-28 Thread Rainie Li
epoint. > The meta information for such requests is only stored locally on each JM > and neither distributed to all JMs nor persisted anywhere. > > Did you send both requests ( the ones for creating a savepoint and one for > querying the status) to the same JM? > > On 6/26/202

Savepoint failure with operation not found under key

2021-06-26 Thread Rainie Li
Hi Flink Community, I found this error when I tried to create a savepoint for my flink job. It's in version 1.9. { "errors": [ "Operation not found under key: org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@57b9711e" ] } Here is error from JM log: 2021-06-2

Re: Flink Version 1.11 job savepoint failures

2021-05-03 Thread Rainie Li
of a bigger issue?) addressed in Flink 1.13 (see > FLINK-21066). I'm pulling Yun Gao into this thread. Let's see whether Yun > can confirm that finding. > > I hope that helps. > Matthias > > [1] https://issues.apache.org/jira/browse/FLINK-21066 > > On Mon, May 3

Flink Version 1.11 job savepoint failures

2021-05-03 Thread Rainie Li
Hi Flink Community, Our flink jobs are in version 1.11 and we use this to trigger savepoint. $ bin/flink savepoint :jobId [:targetDirectory] We can get trigger Id with savepoint path successfully. But we saw these errors by querying savepoint endpoint: https://ci.apache.org/projects/flink/flink-d

Re: Flink application has slightly data loss using Processing Time

2021-03-22 Thread Rainie Li
data. > > Regards, > David > > > > On Sat, Mar 20, 2021 at 12:02 AM Rainie Li wrote: > >> Hi Arvid, >> >> After increasing producer.kafka.request.timeout.ms from 9 to 12. >> The job has been running fine for almost 5 days, but one of the tasks

Re: Flink application has slightly data loss using Processing Time

2021-03-19 Thread Rainie Li
1 at 7:12 AM Rainie Li wrote: > Thanks for the suggestion, Arvid. > Currently my job is using producer.kafka.request.timeout.ms=9 > I will try to increase to 12. > > Best regards > Rainie > > On Thu, Mar 11, 2021 at 3:58 AM Arvid Heise wrote: > >> Hi R

Re: Flink application has slightly data loss using Processing Time

2021-03-11 Thread Rainie Li
loss. So you probably also want to increase the > transaction timeout. > > [1] > https://stackoverflow.com/questions/53223129/kafka-producer-timeoutexception > > On Mon, Mar 8, 2021 at 8:34 PM Rainie Li wrote: > >> Thanks for the info, David. >>

Re: Flink application has slightly data loss using Processing Time

2021-03-08 Thread Rainie Li
loss, depending on how you manage > the offsets, transactions, and other state during the restart. What > happened in this case? > > David > > On Mon, Mar 8, 2021 at 7:53 PM Rainie Li wrote: > >> Thanks Yun and David. >> There were some tasks that got restarted. We

Re: Flink application has slightly data loss using Processing Time

2021-03-08 Thread Rainie Li
ts, or is this discrepancy observed without > any disruption to the processing? > > Regards, > David > > On Mon, Mar 8, 2021 at 10:14 AM Rainie Li wrote: > >> Thanks for the quick response, Smile. >> I don't use window operators or flatmap. >> Here is the core log

Re: Flink application has slightly data loss using Processing Time

2021-03-08 Thread Rainie Li
Thanks for the quick response, Smile. I don't use window operators or flatmap. Here is the core logic of my filter, it only iterates on filters list. Will *rebalance() *cause it? Thanks again. Best regards Rainie SingleOutputStreamOperator> matchedRecordsStream = eventStream .rebalanc

Flink application has slightly data loss using Processing Time

2021-03-08 Thread Rainie Li
Hello Flink Community, Our flink application in v1.9, the basic logic of this application is consuming one large kafka topic and filter some fields, then produce data to a new kafka topic. After comparing the original kafka topic count with the generated kafka topic based on the same field by usin

Re: Flink application kept restarting

2021-03-04 Thread Rainie Li
ou think the issue is due to high latency in communicating with your > Kafka cluster, increase your configured timeouts. > > > > If you think the issue is due to (possibly temporary) dynamic config > changes, check your error handling. Retrying the fetch from the Flink side > wil

Re: Flink application kept restarting

2021-03-04 Thread Rainie Li
destroyed." in your > case. It looks like you had some timeout problem while fetching data from a > Kafka topic. > > Matthias > > On Tue, Mar 2, 2021 at 10:39 AM Rainie Li wrote: > >> Thanks for checking, Matthias. >> >> I have another flink job which

Re: Flink application kept restarting

2021-03-03 Thread Rainie Li
handling the operator-related network buffer is destroyed. That > causes the "java.lang.RuntimeException: Buffer pool is destroyed." in your > case. It looks like you had some timeout problem while fetching data from a > Kafka topic. > > Matthias > > On Tue, Mar 2, 2021 at

Re: Flink application kept restarting

2021-03-02 Thread Rainie Li
problem between the TaskManager and the JobManager that caused the shutdown >> of the operator resulting in the NetworkBufferPool to be destroyed. For >> this scenario I would expect other failures to occur besides the ones you >> shared. >> >> Best, >> Ma

Re: Flink application kept restarting

2021-02-26 Thread Rainie Li
Matthias > > > On Fri, Feb 26, 2021 at 8:39 AM Rainie Li wrote: > >> Hi All, >> >> Our flink application kept restarting and it did lots of RPC calls to a >> dependency service. >> >> *We saw this exception from failed task manager log: * >> o

Flink application kept restarting

2021-02-25 Thread Rainie Li
Hi All, Our flink application kept restarting and it did lots of RPC calls to a dependency service. *We saw this exception from failed task manager log: * org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator at org.apache.flink.s

Re: Flink job finished unexpected

2021-02-24 Thread Rainie Li
right before that time. > > In both cases, Flink should restart the job with the correct restart > policies if configured. > > On Sat, Feb 20, 2021 at 10:07 PM Rainie Li wrote: > >> Hello, >> >> I launched a job with a larger load on hadoop yarn cluster. >&g

Flink job finished unexpected

2021-02-20 Thread Rainie Li
Hello, I launched a job with a larger load on hadoop yarn cluster. The Job finished after running 5 hours, I didn't find any error from JobManger log besides this connect exception. *2021-02-20 13:20:14,110 WARN akka.remote.transport.netty.NettyTransport - Remote connection

Re: Flink app cannot restart

2020-07-23 Thread Rainie Li
than 1. > > > BTW, the logs you provided are not Yarn NodeManager logs. And if you could > provide the full jobmanager > log, it will help a lot. > > > > Best, > Yang > > Rainie Li 于2020年7月22日周三 下午3:54写道: > >> Hi Flink help, >> >> I a

Flink app cannot restart

2020-07-22 Thread Rainie Li
Hi Flink help, I am new to Flink. I am investigating one flink app that cannot restart when we lose yarn node manager (tc.yarn.rm.cluster.NumActiveNMs=0), while other flink apps can restart automatically. *Here is job's restartPolicy setting:* *env.setRestartStrategy(RestartStrategies.fixedDelay

Re: flink app crashed

2020-07-15 Thread Rainie Li
Rainie Li wrote: > Thank you, Jesse. > > Here are more log info: > > 2020-07-15 18:19:36,456 INFO org.apache.flink.client.cli.CliFrontend > - > > 2020

Re: flink app crashed

2020-07-15 Thread Rainie Li
end more > of the log or find an error line that might help others debug. > > > > Thanks, > > Jesse > > > > *From: *Rainie Li > *Date: *Wednesday, July 15, 2020 at 10:54 AM > *To: *"user@flink.apache.org" > *Subject: *flink app crashed > >

flink app crashed

2020-07-15 Thread Rainie Li
Hi All, I am new to Flink, any idea why flink app's Job Manager stuck, here is bottom part from the Job Manager log. Any suggestion will be appreciated. 2020-07-15 16:49:52,749 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.Sta