Re: Watermarks as "process completion" flags

Anton Polyakov Sun, 29 Nov 2015 07:13:25 -0800

Hi Fabian

Defining a special flag for record seems like a checkpoint barrier. I think I 
will end up re-implementing checkpointing myself. I found the discussion in 
flink-dev: mail-archives.apache.org/mod_mbox/flink-dev/201511.mbox/… 
<http://mail-archives.apache.org/mod_mbox/flink-dev/201511.mbox/%3CCA+faj9xDFAUG_zi==e2h8s-8r4cn8zbdon_hf+1rud5pjqv...@mail.gmail.com%3E>
 which seems to solve my task. Essentially they want to have a mechanism which 
will mark record produced by job as “last” and then wait until it’s fully 
propagated through DAG. Similarly to what I need. Essentially my job which 
produces trades can also thought as being finished once it produced all trades, 
then I just need to wait till latest trade produced by this job is processed.


So although windows can probably also be applied, I think propagating barrier 
through DAG and checkpointing at final job is what I need.

Can I possibly utilize internal Flink’s checkpoint barriers (i.e. like 
triggering a custom checkoint or finishing streaming job)? 

> On 24 Nov 2015, at 21:53, Fabian Hueske <fhue...@gmail.com> wrote:
> 
> Hi Anton,
> 
> If I got your requirements right, you are looking for a solution that 
> continuously produces updated partial aggregates in a streaming fashion. When 
> a  special event (no more trades) is received, you would like to store the 
> last update as a final result. Is that correct?
> 
> You can compute continuous updates using a reduce() or fold() function. These 
> will produce a new update for each incoming event.
> For example:
> 
> val s: DataStream[(Int, Long)] = ...
> s.keyBy(_._1)
>   .reduce( (x,y) => (x._1, y._2 + y._2) )
> 
> would continuously compute a sum for every key (_._1) and produce an update 
> for each incoming record.
> 
> You could add a flag to the record and implement a ReduceFunction that marks 
> a record as final when the no-more-trades event is received.
> With a filter and a data sink you could emit such final records to a 
> persistent data store.
> 
> Btw.: You can also define custom trigger policies for windows. A custom 
> trigger is called for each element that is added to a window and when certain 
> timers expire. For example with a custom trigger, you can evaluate a window 
> for every second element that is added. You can also define whether the 
> elements in the window should be retained or removed after the evaluation.
> 
> Best, Fabian
> 
> 
> 
> 2015-11-24 21:32 GMT+01:00 Anton Polyakov <polyakov.an...@gmail.com 
> <mailto:polyakov.an...@gmail.com>>:
> Hi Max
> 
> thanks for reply. From what I understand window works in a way that it 
> buffers records while window is open, then apply transformation once window 
> close is triggered and pass transformed result. 
> In my case then window will be open for few hours, then the whole amount of 
> trades will be processed once window close is triggered. Actually I want to 
> process events as they are produced without buffering them. It is more like a 
> stream with some special mark versus windowing seems more like a batch (if I 
> understand it correctly).
> 
> In other words - buffering and waiting for window to close, then processing 
> will be equal to simply doing one-off processing when all events are 
> produced. I am looking for a solution when I am processing events as they are 
> produced and when source signals "done" my processing is also nearly done.
> 
> 
> On Tue, Nov 24, 2015 at 2:41 PM, Maximilian Michels <m...@apache.org 
> <mailto:m...@apache.org>> wrote:
> Hi Anton,
> 
> You should be able to model your problem using the Flink Streaming
> API. The actions you want to perform on the streamed records
> correspond to transformations on Windows. You can indeed use
> Watermarks to signal the window that a threshold for an action has
> been reached. Otherwise an eviction policy should also do it.
> 
> Without more details about what you want to do I can only refer you to
> the streaming API documentation:
> Please see 
> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html>
> 
> Thanks,
> Max
> 
> On Sun, Nov 22, 2015 at 8:53 PM, Anton Polyakov
> <polyakov.an...@gmail.com <mailto:polyakov.an...@gmail.com>> wrote:
> > Hi
> >
> > I am very new to Flink and in fact never used it. My task (which I 
> > currently solve using home grown Redis-based solution) is quite simple - I 
> > have a system which produces some events (trades, it is a financial system) 
> > and computational chain which computes some measure accumulatively over 
> > these events. Those events form a long but finite stream, they are produced 
> > as a result of end of day flow. Computational logic forms a processing DAG 
> > which computes some measure over these events (VaR). Each trade is 
> > processed through DAG and at different stages might produce different set 
> > of subsequent events (like return vectors), eventually they all arrive into 
> > some aggregator which computes accumulated measure (reducer).
> >
> > Ideally I would like to process trades as they appear (i.e. stream them) 
> > and once producer reaches end of portfolio (there will be no more trades), 
> > I need to write final resulting measure and mark it as “end of day record”. 
> > Of course I also could use a classical batch - i.e. wait until all trades 
> > are produced and then batch process them, but this will be too inefficient.
> >
> > If I use Flink, I will need a sort of watermark saying - “done, no more 
> > trades” and once this watermark reaches end of DAG, final measure can be 
> > saved. More generally would be cool to have an indication at the end of DAG 
> > telling to which input stream position current measure corresponds.
> >
> > I feel my problem is very typical yet I can’t find any solution. All 
> > examples operate either on infinite streams where nobody cares about 
> > completion or classical batch examples which rely on fact all input data is 
> > ready.
> >
> > Can you please hint me.
> >
> > Thank you vm
> > Anton
> 
>

Re: Watermarks as "process completion" flags

Reply via email to