Hi Aaron,

I've recently been looking at this topic and working on a prototype. The
approach I am trying is "backward tracing", or data provenance tracing,
where we try to explain what inputs and steps have affected the production
of an output record.

Arvid has summarized the most important aspects, my approach to UIDs is as
he described. I would like to add a few thoughts.

- With this backward tracing approach, it is very difficult to do sampling,
as aggregations / multi-input operators can only be traced if all inputs are
also traced. So this is more useful if you need to be able to explain the
origins of all output records.

- As Arvid mentioned, the size of the trace records can become big, and
negatively impact the performance of the pipeline. I'd suggest an approach
where each operator directly outputs its traces to some storage. Each trace
record has a UID. If each trace record contains a list/array of its inputs,
and you use an appropriate storage, you can do recursive lookups based on
the trace UIDs to find a complete trace graph for an output record. You may
even want a separate Flink job that pre-processes and pre-aggregates traces
that belong together (although the lateness / ordering might be difficult to
handle)

- If you choose this directly reporting approach, you still need to pass
along the trace UID in the main pipeline, so that the next operator's
produced trace can list it in the inputs.

- If you leave the production of the trace records explicit (as in you have
to construct and collect the trace record manually in each operator), you
can flexibly choose what inputs to include (e.g. for a large aggregation,
you may only want to list some of the aggregated elements as inputs). You
can then also add any additional metadata to help explain a certain step.

- I've looked into adapting this to OpenTracing, but it didn't seem
well-suited for this task. The span-based approach has a parent-child
relationship that doesn't fit the dataflow model too well. In Flink, with
the backward-tracing approach, the "root span" would logically be the output
record, and its children would need to be constructed earlier. I couldn't
find a way to nicely fit this view into the structure of OpenTracing
records.

Let me know your thoughts, I'd be happy to discuss this further.

Regards,

Balazs Varga  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to