Checkpoints and event ordering

2016-02-03 Thread shikhar
reams, where some events might get buffered while later ones make it through? More concretely, when the Kafka consumer persists an offset to Zookeeper based on receiving a checkpoint trigger, can I trust that all events from before that offset are not held in any windowing intermediate state? Th

Kafka partition alignment for event time

2016-02-08 Thread shikhar
My Flink job is doing aggregations on top of event-time based windowing across Kafka partitions. As I have been developing and restarting it, the state for the catch-up periods becomes unreliable -- lots of duplicate emits for time windows already seen before, that I have to discard since my sink c

Re: Kafka partition alignment for event time

2016-02-08 Thread shikhar
Things make more sense after coming across http://mail-archives.apache.org/mod_mbox/flink-user/201512.mbox/%3CCANC1h_vVUT3BkFFck5wJA2ju_sSenxmE=Fiizr=ds6tbasy...@mail.gmail.com%3E I need to ensure the parallelism is at least the number of partitions. This seems like a gotcha that could be better d

Re: Kafka partition alignment for event time

2016-02-08 Thread shikhar
Stephan explained in that thread that we're picking the min watermark when doing operations that join streams from multiple sources. If we have m:n partition-source assignment where m>n, the source is going to end up with the max watermark. Having m<=n ensures that the lowest watermark is used. Re

Re: Kafka partition alignment for event time

2016-02-09 Thread shikhar
I am assigning timestamps using a threshold-based extractor <https://gist.github.com/shikhar/2d9306e2ebd8ca89728c> -- the static delta from last timestamp is probably sufficient and the PriorityQueue for allowing outliers not necessary, that is something I added while figuring out wh

Re: Kafka partition alignment for event time

2016-02-09 Thread shikhar
Hi Fabian, Sorry, I should have been clearer. What I meant (or now know!) by duplicate emits is that since the watermark is progressing more rapidly than the state of the offsets on some partitions due to the source multiplexing more than 1 partition, when messages from the lagging partitions are

Re: Kafka partition alignment for event time

2016-02-09 Thread shikhar
Yes that approach seems perfect Stephan, thanks for creating the JIRA! It is not only when resetting to smallest, I have observed uneven progress on partitions skewing the watermark any time the source is not caught up to the head of each partition it is handling, like when stopping for a few mins

Flink packaging makes life hard for SBT fat jar's

2016-02-12 Thread shikhar
Repro at https://github.com/shikhar/flink-sbt-fatjar-troubles, run `sbt assembly` A fat jar seems like the best way to provide jobs for Flink to execute. I am declaring deps like: {noformat} "org.apache.flink" %% "flink-clients" % "1.0-SNAPSHOT" % "pro

Re: Flink packaging makes life hard for SBT fat jar's

2016-02-15 Thread shikhar
Stephan Ewen wrote > Do you know why you are getting conflicts on the FashHashMap class, even > though the core Flink dependencies are "provided"? Does adding the Kafka > connector pull in all the core Flink dependencies? Yes, the core Flink dependencies are being pulled in transitively from the K

Re: Flink packaging makes life hard for SBT fat jar's

2016-02-17 Thread shikhar
This seems to work to generate the assembly, hopefully not missing any required transitive deps: ``` "org.apache.flink" %% "flink-clients" % flinkVersion % "provided", "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided", "org.apache.kafka" %% "kafka" % "0.8.2.2", ("or

Re: Flink packaging makes life hard for SBT fat jar's

2016-02-22 Thread shikhar
Hi Till, Thanks so much for sorting this out! One suggestion, can the Flink template depend on a connector (Kafka?) -- this would verify that assembly works smoothly for a very common use-case when you need to include connector JAR's. Cheers, Shikhar -- View this message in context:

Re: Flink packaging makes life hard for SBT fat jar's

2016-03-02 Thread shikhar
Hi Till, I just tried creating an assembly with RC4: ``` "org.apache.flink" %% "flink-clients" % flinkVersion % "provided", "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided", "org.apache.flink" %% "flink-connector-kafka-0.8" % flinkVersion, ``` It actually succeeds

Periodic actions

2016-03-03 Thread shikhar
I am trying to have my job also run a periodic action by using a custom source that emits a dummy element periodically and a sink that executes the callback, as shown in the code below. However as soon as I start the job and check the state in the JobManager UI this particular Sink->Source combo is

Re: Windows, watermarks, and late data

2016-03-03 Thread shikhar
In case this helps, this is a Scala helper I am using to filter out late data on a KeyedStream. The last timestamp state is maintained at the key-level. ``` implicit class StrictlyAscendingByTime[T, K](stream: KeyedStream[T, K]) { def filterStrictlyAscendingTime(timestampExtractor: T => Lon

Re: Flink packaging makes life hard for SBT fat jar's

2016-03-03 Thread shikhar
Thanks Till. I can confirm that things are looking good with RC5. sbt-assembly works well with the flink-kafka connector dependency not marked as "provided". -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-packaging-makes-life-hard-for-

Re: Periodic actions

2016-03-04 Thread shikhar
Wow that's embarassing :D That was indeed the issue -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Periodic-actions-tp5290p5304.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.