Hi Aaron,

I'm not too sure about tracing and Flink. It's the first time I heard about
it in this context and I'm not immediately seeing the benefit of it.

What is imho more interesting and a well-formed discipline in the science
of data quality is a concept called data lineage. [1]

I can go quite deep into this topic, but I just skim over the most
important points.

* Ideally, the data processing framework would do the data lineage for you.
Flink doesn't (like none of the obvious competitors afaik, I only know
research prototypes that do) so you have to add additional information to
each record (blow up of record size) or store the lineage information in an
external system (which will make the whole setup usually bottlenecked at
the external system).
* It's worth adding some kind of header to all of your record types with
the same structure. Your best bet is using a format like Avro where you can
reference an external schema.
* You usually want to have a UID on each record as it enters the system.
The best thing is a long id from a database, but you might also resort to
generated UUID (byte form with limited debuggability) or string (adds up to
36 bytes per record).
* Each 1:1 transformation or filtering can retain this UID, but it also
makes sense to generate a new one.
* Each 1:n fan-out transformate, generate new UID.
* Each aggregation, generate new UID.
* When new UID, add UIDs from all input records in some kind of
origins/sources field as an array.
* If the pipeline uses temporary records where you would need to trace
different origin UIDs (two aggregations in the same pipeline), you want to
have multiple layers of origins/sources in the form of a two-dimensional
array or map of arrays.
* If you want to keep the lineage across different pipelines, add producer
name and version to the final records (version is really useful for
monkey-patching errors anyways).
* If you want to trace latency across pipelines, add original timestamps.

As you can guess, for smaller records, the header may easily be larger than
the actual message. Thus, if it's just for debugging, I'd add some options
to the pipeline to skip header processing/generation. If it's for auditing
purposes, you probably have to live with it.

One nice alternative that is possible in Flink, is to work with auxillary
records going out in secondary outputs. So instead of embedding the header
in the record, generate it as a different record going into secondary
storages. Of course, that still requires all records to have UIDs.

Let me know if I misunderstood your original question or if you want to
delve deeper.

[1] https://en.wikipedia.org/wiki/Data_lineage

On Sat, Aug 15, 2020 at 12:26 AM Aaron Levin <aaronle...@stripe.com> wrote:

> Hello Flink Friends!
>
> This is a long-shot, but I'm wondering if anyone is thinking or working on
> applying tracing to Streaming systems and in particular Flink. As far as I
> understand this is a fairly open problem and so I'm curious how folks are
> thinking about it and if anyone has considered how they might apply tracing
> to Flink systems.
>
> Some patterns in Streaming systems fit into tracing fairly easily
> (consumer fanout-out, for example) but many patterns do not. For example,
> how do you trace when there is batching or aggregations? Nevertheless, I'm
> sure some folks have thought about this or even tried to implement
> solutions, and so I'd love to hear about this. Especially if there are any
> standards work in this direction (for example, within the OpenTracing
> project).
>
> If you've thought about this, implemented something, or are working on
> standards related to this, I'd love to hear from you! Thank you!
>
> Best,
>
> Aaron Levin
>


-- 

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng

Reply via email to