Re: Flink job hangs using rocksDb as backend

2018-07-23 Thread Stefan Richter
Hi, yes, timers cannot easily fire in parallel to event processing for correctness reasons because they both manipulate the state and there should be a distinct order of operations. If it is literally stuck, then it is obviously a problem. From the stack trace it looks pretty clear that the cul

Re: Flink job hangs using rocksDb as backend

2018-07-23 Thread shishal singh
Thanks Stefan, You are correct , I learned the hard way that when timers fires it stops processing new events till the time all timers callback completes. This is the points when I decided to isolate the problem by scheduling only 5-6K timers in total so that even if its taking time in timers it s

Re: Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-23 Thread Gerard Garcia
- > 发件人:Gerard Garcia > 发送时间:2018年7月17日(星期二) 21:53 > 收件人:piotr > 抄 送:fhueske ; wangzhijiang999 < > wangzhijiang...@aliyun.com>; user ; nico < > n...@data-artisans.com> > 主 题:Re: Flink job hangs/deadlocks (possibly related to out of m

Re: Flink job hangs using rocksDb as backend

2018-07-23 Thread Stefan Richter
Hi, let me first clarify what you mean by „stuck“, just because your job stops consuming events for some time does not necessarily mean that it is „stuck“. That is very hard to evaluate from the information we have so far, because from the stack trace you cannot conclude that the thread is „stu

Re: Flink job hangs using rocksDb as backend

2018-07-20 Thread shishal singh
Hi Richer, Actually for the testing , now I have reduced the number of timers to few thousands (5-6K) but my job still gets stuck randomly. And its not reproducible each time. next time when I restart the job it again starts working for few few hours/days then gets stuck again. I took thread dum

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-17 Thread Zhijiang(wangzhijiang999)
发送时间:2018年7月17日(星期二) 21:53 收件人:piotr 抄 送:fhueske ; wangzhijiang999 ; user ; nico 主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory) Yes, I'm using Flink 1.5.0 and what I'm serializing is a really big record (probably too big, we have already started working to

Re: Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-17 Thread Piotr Nowojski
ck the >> full gc delay. >> >> Best, >> Zhijiang >> -- >> 发件人:Gerard Garcia mailto:ger...@talaia.io>> >> 发送时间:2018年7月13日(星期五) 16:22 >> 收件人:wangzhijiang999 > <mailto:wangzhijiang.

Re: Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-16 Thread Piotr Nowojski
work. Also you can monitor the gc status to check the full gc > delay. > > Best, > Zhijiang > -- > 发件人:Gerard Garcia mailto:ger...@talaia.io>> > 发送时间:2018年7月13日(星期五) 16:22 > 收件人:wangzhijiang999 <mailto:wangzhijiang...@aliyun.com>> > 抄 送:user mailto:user@flink.a

Re: Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-16 Thread Fabian Hueske
d memory is not > managed by flink framework. Also you can monitor the gc status to check the > full gc delay. > > Best, > Zhijiang > > -- > 发件人:Gerard Garcia > 发送时间:2018年7月13日(星期五) 16:22 > 收件人:wangzhij

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-13 Thread Zhijiang(wangzhijiang999)
framework. Also you can monitor the gc status to check the full gc delay. Best, Zhijiang -- 发件人:Gerard Garcia 发送时间:2018年7月13日(星期五) 16:22 收件人:wangzhijiang999 抄 送:user 主 题:Re: Flink job hangs/deadlocks (possibly related to out of m

Re: Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-13 Thread Gerard Garcia
ing, the task manager process will be exited > finally to trigger restarting the job. > > Zhijiang > > -- > 发件人:Gerard Garcia > 发送时间:2018年7月2日(星期一) 18:29 > 收件人:wangzhijiang999 > 抄 送:user > 主 题:Re: Flink job

Re: Flink job hangs using rocksDb as backend

2018-07-12 Thread Stefan Richter
Hi, Did you check the metrics for the garbage collector? Stuck with high CPU consumption and lots of timers sound like there could be a possible problem, because timer are currently on-heap objects, but we are working on RocksDB-based timers right now. Best, Stefan > Am 12.07.2018 um 14:54 sc

Re: Flink job hangs using rocksDb as backend

2018-07-12 Thread shishal singh
Thanks Stefan/Stephan/Nico, Indeed there are 2 problem. For the 2nd problem ,I am almost certain that explanation given by Stephan is the true as in my case as there number of timers are in millions. (Each for different key so I guess coalescing is not an option for me). If I simplify my problem,

Re: Flink job hangs using rocksDb as backend

2018-07-12 Thread Stefan Richter
Hi, adding to what has already been said, I think that here can be two orthogonal problems here: i) why is your job slowing down/getting stuck? and ii) why is cancellation blocked? As for ii) I think Stephan already gave to right reason that shutdown could take longer and that is what gets the

Re: Flink job hangs using rocksDb as backend

2018-07-11 Thread Nico Kruber
If this is about too many timers and your application allows it, you may also try to reduce the timer resolution and thus frequency by coalescing them [1]. Nico [1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/operators/process_function.html#timer-coalescing On 11/07/1

Re: Flink job hangs using rocksDb as backend

2018-07-11 Thread Stephan Ewen
Hi shishal! I think there is an issue with cancellation when many timers fire at the same time. These timers have to finish before shutdown happens, this seems to take a while in your case. Did the TM process actually kill itself in the end (and got restarted)? On Wed, Jul 11, 2018 at 9:29 AM,

Flink job hangs using rocksDb as backend

2018-07-11 Thread shishal
Hi, I am using flink 1.4.2 with rocksdb as backend. I am using process function with timer on EventTime. For checkpointing I am using hdfs. I am trying load testing so Iam reading kafka from beginning (aprox 7 days data with 50M events). My job gets stuck after aprox 20 min with no error. There

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-02 Thread Zhijiang(wangzhijiang999)
to trigger restarting the job. Zhijiang -- 发件人:Gerard Garcia 发送时间:2018年7月2日(星期一) 18:29 收件人:wangzhijiang999 抄 送:user 主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory) Thanks Zhijiang, We haven't found any

回复:Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-02 Thread Zhijiang(wangzhijiang999)
whether and where caused the OOM. Maybe check the task failure logs. Zhijiang -- 发件人:gerardg 发送时间:2018年6月30日(星期六) 00:12 收件人:user 主 题:Re: Flink job hangs/deadlocks (possibly related to out of memory) (fixed formatting) Hello

Re: Flink job hangs/deadlocks (possibly related to out of memory)

2018-06-29 Thread gerardg
(fixed formatting) Hello, We have experienced some problems where a task just hangs without showing any kind of log error while other tasks running in the same task manager continue without problems. When these tasks are restarted the task manager gets killed and shows several errors similar to

Flink job hangs/deadlocks (possibly related to out of memory)

2018-06-29 Thread gerardg
Hello,We have experienced some problems where a task just hangs without showing any kind of log error while other tasks running in the same task manager continue without problems. When these tasks are restarted the task manager gets killed and shows several errors similar to these ones:[Canceler/In

Re: Simple batch job hangs if run twice

2016-09-23 Thread Yassine MARZOUGUI
I found out how to dump the stacktrace (using jps & jtrace). Please find attached the stacktrace I got when the job got stuck. Thanks, Yassine 2016-09-23 11:48 GMT+02:00 Fabian Hueske : > Yes, log files and stacktraces are different things. > A stacktrace shows the call hierarchy of all threads

Re: Simple batch job hangs if run twice

2016-09-23 Thread Fabian Hueske
Yes, log files and stacktraces are different things. A stacktrace shows the call hierarchy of all threads in a JVM at the time when it is taken. So you can see the method that is currently executed (and from where it was called) when the stacktrace is taken. In case of a deadlock, you see where the

Re: Simple batch job hangs if run twice

2016-09-23 Thread Yassine MARZOUGUI
Hi Fabian, Not sure if this answers your question, here is the stack I got when debugging the combine and datasource operators when the job got stuck: "DataSource (at main(BatchTest.java:28) (org.apache.flink.api.java.io.TupleCsvInputFormat)) (1/8)" at java.lang.Object.wait(Object.java) at org.ap

Re: Simple batch job hangs if run twice

2016-09-23 Thread Fabian Hueske
Hi Yassine, can you share a stacktrace of the job when it got stuck? Thanks, Fabian 2016-09-22 14:03 GMT+02:00 Yassine MARZOUGUI : > The input splits are correctly assgined. I noticed that whenever the job > is stuck, that is because the task *Combine (GroupReduce at > first(DataSet.java:573)) *

Re: Simple batch job hangs if run twice

2016-09-22 Thread Robert Metzger
Can you try running with DEBUG logging level? Then you should see if input splits are assigned. Also, you could try to use a debugger to see what's going on. On Mon, Sep 19, 2016 at 2:04 PM, Yassine MARZOUGUI < y.marzou...@mindlytix.com> wrote: > Hi Chensey, > > I am running Flink 1.1.2, and usin

Re: Simple batch job hangs if run twice

2016-09-19 Thread Yassine MARZOUGUI
Hi Chensey, I am running Flink 1.1.2, and using NetBeans 8.1. I made a screencast reproducing the problem here: http://recordit.co/P53OnFokN4 . Best, Yassine 2016-09-19 10:04 GMT+02:00 Chesnay Schepler : > No, I can't recall that i had this happen to me. > > I wo

Re: Simple batch job hangs if run twice

2016-09-19 Thread Chesnay Schepler
No, I can't recall that i had this happen to me. I would enable logging and try again, as well as checking whether the second job is actually running through the WebInterface. If you tell me your NetBeans version i can try to reproduce it. Also, which version of Flink are you using? On 19.09

Re: Simple batch job hangs if run twice

2016-09-18 Thread Aljoscha Krettek
Hmm, this sound like it could be IDE/Windows specific, unfortunately I don't have access to a windows machine. I'll loop in Chesnay how is using windows. Chesnay, do you maybe have an idea what could be the problem? Have you ever encountered this? On Sat, 17 Sep 2016 at 15:30 Yassine MARZOUGUI w

Re: Simple batch job hangs if run twice

2016-09-17 Thread Yassine MARZOUGUI
Hi Aljoscha, Thanks for your response. By the first time I mean I hit run from the IDE (I am using Netbeans on Windows) the first time after building the program. If then I stop it and run it again (without rebuidling) It is stuck in the state RUNNING. Sometimes I have to rebuild it, or close the

Re: Simple batch job hangs if run twice

2016-09-17 Thread Aljoscha Krettek
Hi, when is the "first time". It seems you have tried this repeatedly so what differentiates a "first time" from the other times? Are you closing your IDE in-between or do you mean running the job a second time within the same program? Cheers, Aljoscha On Fri, 9 Sep 2016 at 16:40 Yassine MARZOUGU

Simple batch job hangs if run twice

2016-09-09 Thread Yassine MARZOUGUI
Hi all, When I run the following batch job inside the IDE for the first time, it outputs results and switches to FINISHED, but when I run it again it is stuck in the state RUNNING. The csv file size is 160 MB. What could be the reason for this behaviour? public class BatchJob { public static

Re: Job hangs

2016-04-27 Thread Fabian Hueske
> I've previously seen large batch jobs hang because of join deadlocks. We > should have fixed those problems, but we might have missed some corner > case. Did you check whether there was any cpu activity when the job hangs? > Can you try running htop on the taskmanager machines an

Re: Job hangs

2016-04-27 Thread Vasiliki Kalavri
Hi Timur, I've previously seen large batch jobs hang because of join deadlocks. We should have fixed those problems, but we might have missed some corner case. Did you check whether there was any cpu activity when the job hangs? Can you try running htop on the taskmanager machines and s

Re: Job hangs

2016-04-26 Thread Timur Fayruzov
Robert, Ufuk, logs, execution plan and a screenshot of the console are in the archive: https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0 Note that when I looked in the backpressure view I saw back pressure 'high' on following paths: Input->code_line:123,124->map->join Input->code_

Re: Job hangs

2016-04-26 Thread Ufuk Celebi
Can you please further provide the execution plan via env.getExecutionPlan() On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov wrote: > Hello Robert, > > I observed progress for 2 hours(meaning numbers change on dashboard), and > then I waited for 2 hours more. I'm sure it had to spill at some p

Re: Job hangs

2016-04-26 Thread Timur Fayruzov
Hello Robert, I observed progress for 2 hours(meaning numbers change on dashboard), and then I waited for 2 hours more. I'm sure it had to spill at some point, but I figured 2h is enough time. Thanks, Timur On Apr 26, 2016 1:35 AM, "Robert Metzger" wrote: > Hi Timur, > > thank you for sharing t

Re: Job hangs

2016-04-26 Thread Robert Metzger
Hi Timur, thank you for sharing the source code of your job. That is helpful! Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is much more IO heavy with the larger input data because all the joins start spilling? Our monitoring, in particular for batch jobs is really not very adv

Re: Job hangs

2016-04-26 Thread Ufuk Celebi
No. If you run on YARN, the YARN logs are the relevant ones for the JobManager and TaskManager. The client log submitting the job should be found in /log. – Ufuk On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov wrote: > I will do it my tomorrow. Logs don't show anything unusual. Are there any >

Re: Job hangs

2016-04-26 Thread Timur Fayruzov
I will do it my tomorrow. Logs don't show anything unusual. Are there any logs besides what's in flink/log and yarn container logs? On Apr 26, 2016 1:03 AM, "Ufuk Celebi" wrote: Hey Timur, is it possible to connect to the VMs and get stack traces of the Flink processes as well? We can first hav

Re: Job hangs

2016-04-26 Thread Ufuk Celebi
Hey Timur, is it possible to connect to the VMs and get stack traces of the Flink processes as well? We can first have a look at the logs, but the stack traces will be helpful if we can't figure out what the issue is. – Ufuk On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann wrote: > Could you sha

Re: Job hangs

2016-04-26 Thread Till Rohrmann
Could you share the logs with us, Timur? That would be very helpful. Cheers, Till On Apr 26, 2016 3:24 AM, "Timur Fayruzov" wrote: > Hello, > > Now I'm at the stage where my job seem to completely hang. Source code is > attached (it won't compile but I think gives a very good idea of what > happ

Job hangs

2016-04-25 Thread Timur Fayruzov
Hello, Now I'm at the stage where my job seem to completely hang. Source code is attached (it won't compile but I think gives a very good idea of what happens). Unfortunately I can't provide the datasets. Most of them are about 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB m