Re: lineage in Samza and tracking processed events

Garrett Barton Wed, 02 Dec 2015 11:17:39 -0800

How about writing out a lineage record to a different topic from every
samza job?  Throw a little meta data along with the data such that every
job that touches a piece of data also writes out read/wrote records to a
separate tracking topic.
A read/wrote record would look something like:
{orig_id, # first time a record is seen you put its id in here
 new_id, # for the one to many's, weather you track by this id or just the
original id (I think orig is good enough) is up to you
 job, # job name, so you know who saw it
 action, # read/wrote/skipped/etc
 time # stats/latency reasons
}
First job would write the first read/wrote record (whatever id the original
record is given is the key for all read/wrote records), then its a simple
summation game to tell where every process is downstream with consuming as
many forked variations of the original record.


 I have not done that in samza yet but its pretty standard ops for other
processing I've done.

On Wed, Dec 2, 2015 at 10:37 AM, Rick Mangi <r...@chartbeat.com> wrote:

> Hi Anton,
>
> Samza doesn’t have the same concept of an ack as Storm does built in. This
> could be seen as a good or bad thing.
> On one hand, the ack is very expensive in storm, on the other hand you can
> very easily do what you are describing.
> Samza topologies aren't DAGs, you can have jobs that feed back into
> themselves.
>
> In Samza, the checkpoint in kafka for each system consumer is probably the
> closest equivalent. You can use the checkpointing to ensure that you have
> processed every message.
> In the case of fanning out to sub-tasks, I would suggest aggregating the
> results of each task downstream with another job that can signal when each
> sub-task has completed.
>
> Perhaps someone else has a better solution.
>
> HTH,
>
> Rick
>
>
>
> > On Dec 2, 2015, at 1:41 AM, Anton Polyakov <polyakov.an...@gmail.com>
> wrote:
> >
> > Hi Rick
> >
> > But processing pipeline is more than 1 job working in parallel, not
> > serially. This is why single "done" flag is not enough. Imagine a
> situation
> > when 1 trade creates 3 sub-events and they are processed concurreently
> by 3
> > instances of the job. Then I need at least to wait for 3 flags. And if
> > pipeline is more complex, it can be more flags all of which I have to
> know
> > in advance - so I will stick to some known topology and if it changes,
> > "finishing" condition also changes
> >
> > On Wednesday, December 2, 2015, Rick Mangi <r...@chartbeat.com> wrote:
> >
> >> The simplest way would be to have your job send a message to another
> kafka
> >> topic when it’s done.
> >>
> >> Rick
> >>
> >>
> >>
> >>> On Dec 1, 2015, at 3:44 PM, Anton Polyakov <polyakov.an...@gmail.com
> >> <javascript:;>> wrote:
> >>>
> >>> Hi
> >>>
> >>> I am looking at Samza to process some incoming stream of trades.
> >> Processing pipeline is a complex DAG where some nodes might create zero
> to
> >> many descendant events. Ultimately they got to the end sink (and now
> these
> >> are completely different events all originated by one source trade)
> >>>
> >>> I am looking for a answer to quite fundamental (in my view) question -
> >> given source trade, how could I know if it was fully processed by DAG
> or is
> >> it still in flight? It is called lineage tracking in Storm afaik.
> >>>
> >>> In my situation when trade event is fired I need a reliable way to
> >> determine when processing is done in order to treat results. I googled a
> >> lot of algorithms - checkpointing in Flink, transactions in Storm.
> >> Checkpointing in Flink could work but unfortunately I can’t use them
> from
> >> my source to mark a trade with checkpoint barrier - API is not
> available.
> >> Looking at Samza there’s nothing obvious either.
> >>>
> >>> I have a very uncomfortable feeling that the problem should be very
> >> basic and fundamental, maybe I am using wrong terms to explain it. take
> it
> >> to extreme - one can send 1 event to Samza and processing is takes 15
> >> minutes on some node. How on the end of the pipeline should I know when
> its
> >> done?
> >>>
> >>> Thank you
> >>
> >>
>
>

Re: lineage in Samza and tracking processed events

Reply via email to