Re: JobManager trying to re-submit jobs after failover

2016-07-27 Thread Hironori Ogibayashi
Thank you for telling me about the cause. It recovered by restarting jobmanager-5 and jobmanager-1. I restart jobmanager-1 because when I restarted jobmanager-5 , checkpointing started to fail with the following message. 2016-07-28 10:42:28,217 WARN org.apache.flink.runtime.checkpoint.Checkp

how to start tuning to prevent OutOfMemory

2016-07-27 Thread Istvan Soos
Hi, We can see an occasional OOM issue with our Flink jobs. Maybe the input got more diverse, and the grouping has much more keys, not really sure about that part. How do you usually tackle these issues? We are running with parallelism between 5-30. Would it help if we turn that down? We do set

Re: JobManager trying to re-submit jobs after failover

2016-07-27 Thread Ufuk Celebi
Thanks for the logs. Looking through them it's caused by this issue: https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph (Flink's internal scheduling structure) is not terminated properly and tries to restart the job over and over again. This is fixed for 1.1.0. Is it an option fo

Re: .so linkage error in Cluster

2016-07-27 Thread Debaditya Roy
Hello users, I rebuilt the project and now on doing mvn clean package i have got two jar files and I can run with the fat jar in the local jvm properly. However when executing in the cluster I get error as follows: Source: Custom Source -> Flat Map -> Sink: Unnamed(1/1) switched to FAILED java.la

OffHeap support in Flink

2016-07-27 Thread Aakash Agrawal
Hello, I am comparing Flink, Spark and some other streaming frameworks to find the best fit for my project. Currently we have a system, which works on single server and uses off-heap to save data. We now want to go distributed with streaming support. So we have designed a rough data flow for th

Re: No output when using event time with multiple Kafka partitions

2016-07-27 Thread Yassin Marzouki
I just tried playing with the source paralleism setting, and I got a very strange result: If specify the source parallism using env.addSource(kafka).setParallelism(N), results are printed correctly for any number N except for N=4. I guess that's related to the number of task slots since I have a 4

Re: No output when using event time with multiple Kafka partitions

2016-07-27 Thread Yassin Marzouki
Hi Kostas, When I remove the window and the apply() and put print() after assignTimestampsAndWatermarks, the messages are printed correctly: 2> Request{ts=2015-01-01, 06:15:34:000} 2> Request{ts=2015-01-02, 16:38:10:000} 2> Request{ts=2015-01-02, 18:58:41:000} 2> Request{ts=2015-01-02, 19:10:00:0

Re: JobManager trying to re-submit jobs after failover

2016-07-27 Thread Ufuk Celebi
Which version of Flink are you running on? I think this might have been fixed for the 1.1 release (http://people.apache.org/~uce/flink-1.1.0-rc1/). It looks like the ExecutionGraph is still trying to restart although the JobManager is not the leader anymore. If you could provide the complete logs

Re: Sporadic exceptions when checkpointing to S3

2016-07-27 Thread Ufuk Celebi
Hey Gary, your configuration looks good to me. I think that it could be an issue with S3 as you suggest. It might help to decrease the checkpointing interval (if you use case requirements allow for this) in order to have less interaction with S3. In general, your program should still continue as e

Re: If I chain two windows, what event-time would the second window have?

2016-07-27 Thread Yassin Marzouki
Hi Kostas, Thank you very much for the explanation. Best, Yassine On Wed, Jul 27, 2016 at 1:09 PM, Kostas Kloudas wrote: > Hi Yassine, > > When the WindowFunction is applied to the content of a window, the > timestamp of the resulting record > is the window.maxTimestamp, which is the endOfWind

JobManager trying to re-submit jobs after failover

2016-07-27 Thread Hironori Ogibayashi
Hello, I have standalone Flink cluster with JobManager HA. Last night, JobManager failovered because of the connection timeout to Zookeeper. Job is successfully running under new leader JobManager, but when I see the old leader JobManager log, it is trying to re-submit job and getting errors. ( fo

Re: fault tolerance: suspend and resume?

2016-07-27 Thread Ufuk Celebi
Yes, the back pressure behaviour you describe is correct. With checkpointing enable, the job should resume as soon as the sink can contact the backend service again (you would see that the job fails many times until the service is live again, but at the end it should work). You can control the rest

Re: No output when using event time with multiple Kafka partitions

2016-07-27 Thread Kostas Kloudas
Hi Yassine, Could you just remove the window and the apply, and just put a print() after the: > .assignTimestampsAndWatermarks(new AscendingTimestampExtractor() { > @Override > public long extractAscendingTimestamp(Request req) { > return req.ts; > } > }) This at least will

Re: If I chain two windows, what event-time would the second window have?

2016-07-27 Thread Kostas Kloudas
Hi Yassine, When the WindowFunction is applied to the content of a window, the timestamp of the resulting record is the window.maxTimestamp, which is the endOfWindow-1. You can imaging if you have a Tumbling window from 0 to 2000, the result will have a timestamp of 1999. Window boundaries are

AW: Performance issues with GroupBy?

2016-07-27 Thread Paschek, Robert
Hi Gábor, hi Ufuk, hi Greg, thank you for your very helpful responses! > You can try to make your `RichGroupReduceFunction` implement the > `GroupCombineFunction` interface, so that Flink can do combining > before the shuffle, which might significantly reduce the network load. > (How much the

[VOTE] Release Apache Flink 1.1.0 (RC1)

2016-07-27 Thread Ufuk Celebi
Dear Flink community, Please vote on releasing the following candidate as Apache Flink version 1.1.0. I've CC'd user@flink.apache.org as users are encouraged to help testing Flink 1.1.0 for their specific use cases. Please feel free to report issues and successful tests on d...@flink.apache.org.

Sporadic exceptions when checkpointing to S3

2016-07-27 Thread Gary Yao
Hi all, I am using the filesystem state backend with checkpointing to S3. From the JobManager logs, I can see that it works most of the time, e.g., 2016-07-26 17:49:07,311 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 3 @ 1469555347310 2016-07-26 17

fault tolerance: suspend and resume?

2016-07-27 Thread Istvan Soos
Hi, I was wondering how Flink's fault tolerance works, because this page is short on the details: https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html My environment has a backend service that may be out for a couple of hours (sad, but working on fixing that). I