Hi Aaron, I've recently been looking at this topic and working on a prototype. The approach I am trying is "backward tracing", or data provenance tracing, where we try to explain what inputs and steps have affected the production of an output record.
Arvid has summarized the most important aspects, my approach to UIDs is as he described. I would like to add a few thoughts. - With this backward tracing approach, it is very difficult to do sampling, as aggregations / multi-input operators can only be traced if all inputs are also traced. So this is more useful if you need to be able to explain the origins of all output records. - As Arvid mentioned, the size of the trace records can become big, and negatively impact the performance of the pipeline. I'd suggest an approach where each operator directly outputs its traces to some storage. Each trace record has a UID. If each trace record contains a list/array of its inputs, and you use an appropriate storage, you can do recursive lookups based on the trace UIDs to find a complete trace graph for an output record. You may even want a separate Flink job that pre-processes and pre-aggregates traces that belong together (although the lateness / ordering might be difficult to handle) - If you choose this directly reporting approach, you still need to pass along the trace UID in the main pipeline, so that the next operator's produced trace can list it in the inputs. - If you leave the production of the trace records explicit (as in you have to construct and collect the trace record manually in each operator), you can flexibly choose what inputs to include (e.g. for a large aggregation, you may only want to list some of the aggregated elements as inputs). You can then also add any additional metadata to help explain a certain step. - I've looked into adapting this to OpenTracing, but it didn't seem well-suited for this task. The span-based approach has a parent-child relationship that doesn't fit the dataflow model too well. In Flink, with the backward-tracing approach, the "root span" would logically be the output record, and its children would need to be constructed earlier. I couldn't find a way to nicely fit this view into the structure of OpenTracing records. Let me know your thoughts, I'd be happy to discuss this further. Regards, Balazs Varga -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/