How about writing out a lineage record to a different topic from every samza job? Throw a little meta data along with the data such that every job that touches a piece of data also writes out read/wrote records to a separate tracking topic. A read/wrote record would look something like: {orig_id, # first time a record is seen you put its id in here new_id, # for the one to many's, weather you track by this id or just the original id (I think orig is good enough) is up to you job, # job name, so you know who saw it action, # read/wrote/skipped/etc time # stats/latency reasons } First job would write the first read/wrote record (whatever id the original record is given is the key for all read/wrote records), then its a simple summation game to tell where every process is downstream with consuming as many forked variations of the original record.
I have not done that in samza yet but its pretty standard ops for other processing I've done. On Wed, Dec 2, 2015 at 10:37 AM, Rick Mangi <r...@chartbeat.com> wrote: > Hi Anton, > > Samza doesn’t have the same concept of an ack as Storm does built in. This > could be seen as a good or bad thing. > On one hand, the ack is very expensive in storm, on the other hand you can > very easily do what you are describing. > Samza topologies aren't DAGs, you can have jobs that feed back into > themselves. > > In Samza, the checkpoint in kafka for each system consumer is probably the > closest equivalent. You can use the checkpointing to ensure that you have > processed every message. > In the case of fanning out to sub-tasks, I would suggest aggregating the > results of each task downstream with another job that can signal when each > sub-task has completed. > > Perhaps someone else has a better solution. > > HTH, > > Rick > > > > > On Dec 2, 2015, at 1:41 AM, Anton Polyakov <polyakov.an...@gmail.com> > wrote: > > > > Hi Rick > > > > But processing pipeline is more than 1 job working in parallel, not > > serially. This is why single "done" flag is not enough. Imagine a > situation > > when 1 trade creates 3 sub-events and they are processed concurreently > by 3 > > instances of the job. Then I need at least to wait for 3 flags. And if > > pipeline is more complex, it can be more flags all of which I have to > know > > in advance - so I will stick to some known topology and if it changes, > > "finishing" condition also changes > > > > On Wednesday, December 2, 2015, Rick Mangi <r...@chartbeat.com> wrote: > > > >> The simplest way would be to have your job send a message to another > kafka > >> topic when it’s done. > >> > >> Rick > >> > >> > >> > >>> On Dec 1, 2015, at 3:44 PM, Anton Polyakov <polyakov.an...@gmail.com > >> <javascript:;>> wrote: > >>> > >>> Hi > >>> > >>> I am looking at Samza to process some incoming stream of trades. > >> Processing pipeline is a complex DAG where some nodes might create zero > to > >> many descendant events. Ultimately they got to the end sink (and now > these > >> are completely different events all originated by one source trade) > >>> > >>> I am looking for a answer to quite fundamental (in my view) question - > >> given source trade, how could I know if it was fully processed by DAG > or is > >> it still in flight? It is called lineage tracking in Storm afaik. > >>> > >>> In my situation when trade event is fired I need a reliable way to > >> determine when processing is done in order to treat results. I googled a > >> lot of algorithms - checkpointing in Flink, transactions in Storm. > >> Checkpointing in Flink could work but unfortunately I can’t use them > from > >> my source to mark a trade with checkpoint barrier - API is not > available. > >> Looking at Samza there’s nothing obvious either. > >>> > >>> I have a very uncomfortable feeling that the problem should be very > >> basic and fundamental, maybe I am using wrong terms to explain it. take > it > >> to extreme - one can send 1 event to Samza and processing is takes 15 > >> minutes on some node. How on the end of the pipeline should I know when > its > >> done? > >>> > >>> Thank you > >> > >> > >