Thanks for filling us in.

If the problem comes from the fact that the difference between partitions
becomes high sometimes (when resetting to the smallest offset),
then this could probably be solved similarly as suggested here (
https://issues.apache.org/jira/browse/FLINK-3375) by running
a watermark assigner (ascending, threashold / whatever) per partition
inside the Kafka Source.

What do you think?


On Tue, Feb 9, 2016 at 3:01 PM, shikhar <shik...@schmizz.net> wrote:

> I am assigning timestamps using a  threshold-based extractor
> <https://gist.github.com/shikhar/2d9306e2ebd8ca89728c>   -- the static
> delta
> from last timestamp is probably sufficient and the PriorityQueue for
> allowing outliers not necessary, that is something I added while figuring
> out what was going on.
>
> The timestamps across partitions don't differ that much in normal operation
> when stream processing is caught up with the head of the partitions, so the
> thresholding works well. However, during catch-up, like if I stop for a bit
> & start the job again, or there is no offset in ZK and I'm using
> 'auto.offset.reset=smallest', the source tends to emit messages with much
> larger deviations, and the timestamp extraction which is not
> partition-aware
> will start providing an incorrect watermark.
>
>
> Aljoscha Krettek wrote
> > Hi,
> > in general it should not be a problem if one parallel instance of a sink
> > is responsible for several Kafka partitions. It can become a problem if
> > the timestamps in the different partitions differ by a lot and the
> > watermark assignment logic is not able to handle this.
> >
> > How are you assigning the timestamps/watermarks in your job?
> >
> > Cheers,
> > Aljoscha
> >> On 08 Feb 2016, at 21:51, shikhar &lt;
>
> > shikhar@
>
> > &gt; wrote:
> >>
> >> Stephan explained in that thread that we're picking the min watermark
> >> when
> >> doing operations that join streams from multiple sources. If we have m:n
> >> partition-source assignment where m>n, the source is going to end up
> with
> >> the max watermark. Having m<=n ensures that the lowest watermark is
> used.
> >>
> >> Re: automatic enforcement, perhaps allowing for more than 1 Kafka
> >> partition
> >> on a source should require opt-in, e.g. allowOversubscription()
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kafka-partition-alignment-for-event-time-tp4782p4788.html
> >> Sent from the Apache Flink User Mailing List archive. mailing list
> >> archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kafka-partition-alignment-for-event-time-tp4782p4816.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Reply via email to