Re: Distributed Tracing in Apache Beam

Kenneth Knowles Tue, 21 Apr 2020 08:31:27 -0700

+dev <d...@beam.apache.org>

I don't have a ton of time to dig in to this, but I wanted to say that this
is very cool and just drop a couple pointers (which you may already know
about) like Explaining Outputs in Modern Data Analytics [1] which was
covered by The Morning Paper [2]. This just happens to be something I read
a few years ago - following citations of the paper or pingbacks on the blog
post yields a lot more work, some of which may be helpful. There seems to
be a slight difference of emphasis between tracing in an arbitrary
distributed system versus explaining big data results. I would expect
general tracing (which Jaeger is?) to be more complex and expensive to run,
but that's just an intuition.


Kenn

[1] http://www.vldb.org/pvldb/vol9/p1137-chothia.pdf
[2]
https://blog.acolyer.org/2017/02/01/explaining-outputs-in-modern-data-analytics/

On Fri, Apr 17, 2020 at 10:56 AM Rion Williams <rionmons...@gmail.com>
wrote:

> Hi Alexey,
>
> I think you’re right about the wrapper, it’s likely unnecessary as I think
> I’d have enough information in the headers to rehydrate my “tracer” that
> communicates the traces/spans to Jaeger as needed. I’d love to not have to
> touch those or muddy the waters with a wrapper class, additional conversion
> steps, custom coder, etc.
>
> Speaking of conversions, I agree entirely with the unified interface for
> reading/writing to Kafka. I’ll openly admit I spent far too long fighting
> with it before discovering that `withoutMetadata()` existed. So if those
> were unified and writeRecords could accept a Kafka one, that’d be great.
>
>
> > On Apr 17, 2020, at 12:47 PM, Alexey Romanenko <aromanenko....@gmail.com>
> wrote:
> >
> > Hi Rion,
> >
> > In general, yes, it sounds reasonable to me. I just do not see why you
> need to have extra Traceable wrapper? Do you need to keep some temporary
> information there that you don’t want to store in Kafka record headers?
> >
> > PS: Now I started to think that we probably have to change an interface
> of KafkaIO.writeRecords() from ProducerRecord to the same KafkaRecord as we
> use for read. In this case we won’t expose Kafka API and use only own
> wrapper.
> > Also, user won’t need to convert types between Read and Write (like in
> this topic case).
> >
> >> On 17 Apr 2020, at 19:28, Rion Williams <rionmons...@gmail.com> wrote:
> >>
> >> Hi Alexey,
> >>
> >> So this is currently the approach that I'm taking. Basically creating a
> wrapper Traceable<K,V> class that will contain all of my record information
> as well as the data necessary to update the traces for that record. It
> requires an extra step and will likely mean persisting something along side
> each record as it comes in, but I'm not sure if there's another way around
> it.
> >>
> >> My current approach goes something like this:
> >> - Read records via KafkaIO (with metadata)
> >> - Apply a transform to convert all KafkaRecord<K,V> into Traceable<K,V>
> instances (which just contain a Tracer object as well as the original
> KV<K,V> record itself)
> >> - Pass this Traceable through all of the appropriate transforms,
> creating new spans for the trace as necessary via the tracer element on the
> Traceable<K,V> object.
> >> - Prior to output to a Kafka topic, transform the Traceable<K, V>
> object in to a ProducerRecord<K,V> that contains the key, value, and
> headers (from the tracer) prior to writing back to Kafka
> >>
> >> I think this will work, but it'll likely take quite a bit of
> experimentation to verify. Does this sound reasonable?
> >>
> >> Thanks,
> >>
> >> Rion
> >>
> >>> On 2020/04/17 17:14:58, Alexey Romanenko <aromanenko....@gmail.com>
> wrote:
> >>> Not sure if it will help, but KafkaIO allows to keep all meta
> information while reading (using KafkaRecord) and writing (using
> ProducerRecord).
> >>> So, you can keep your tracing id in the record headers as you did with
> Kafka Streams.
> >>>
> >>>> On 17 Apr 2020, at 18:58, Rion Williams <rionmons...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Alex,
> >>>>
> >>>> As mentioned before, I'm in the process of migrating a pipeline of
> several Kafka Streams applications over to Apache Beam and I'm hoping to
> leverage the tracing infrastructure that I had established using Jaeger
> whenever I can, but specifically to trace an element as it flows through a
> pipeline or potentially multiple pipelines.
> >>>>
> >>>> An example might go something like this:
> >>>>
> >>>> - An event is produced from some service and sent to a Kafka Topic
> (with a tracing id in the headers)
> >>>> - The event enters my pipeline (Beam reads from that topic) and
> begins applying a series of transforms that evaluate the element itself
> (e.g. does it have any users associated with it, IP addresses, other
> interesting information).
> >>>> - When interesting information is encountered on the element (or
> errors), I'd like to be able to associate them with the trace (e.g. a user
> was found, this is some information about the user, this is the unique
> identifier associated with them, or there was an error because the user had
> a malformed e-mail address)
> >>>> - The traces themselves would be cumulative, so if an event was
> processed through one pipeline, it would contain all the necessary tracing
> headers in the message so if another pipeline picked it up from its
> destination topic (e.g. the destination of the first pipeline), the trace
> could be continued.
> >>>>
> >>>> I think that being able to pick up interactive systems would be a
> nice to have (e.g. this record is being sent to Parquet, Mongo, or some
> other topic), but I'm just trying to focus on being able to add to the
> trace at the ParDo/element level for now.
> >>>>
> >>>> I hope that helps.
> >>>>
> >>>> Rion
> >>>>
> >>>> On 2020/04/17 16:30:14, Alex Van Boxel <a...@vanboxel.be> wrote:
> >>>>> Can you explain a bit more of what you want to achieve here?
> >>>>>
> >>>>> Do you want to trace how your elements go to the pipeline or do you
> want to
> >>>>> see how every ParDo interacts with external systems?
> >>>>>
> >>>>> On Fri, Apr 17, 2020, 17:38 Rion Williams <rionmons...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I'm reaching out today to inquire if Apache Beam has any support or
> >>>>>> mechanisms to support some type of distributed tracing similar to
> something
> >>>>>> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a
> Java
> >>>>>> SDK, however due to the nature of Beam working with transforms that
> yield
> >>>>>> immutable collections, I wasn't sure what the best avenue would be
> to
> >>>>>> correlate various transforms against a particular element would be?
> >>>>>>
> >>>>>> Coming from a Kafka Streams background, this process was pretty
> trivial as
> >>>>>> I'd simply store my correlation identifier within the message
> headers and
> >>>>>> those would be persisted as the element traveled through Kafka into
> various
> >>>>>> applications and topics. I'm hoping to still leverage some of that
> in Beam
> >>>>>> if at all possible or see what, if any, recommended approaches
> there are
> >>>>>> out there.
> >>>>>>
> >>>>>> My current approach involves the creation of a "Tracing Context"
> which
> >>>>>> would just be a wrapper for each of my elements that had their own
> >>>>>> associated trace with them and instead of just passing around a
> >>>>>> PCollection<X> I would use a PCollection<Tracable<X>> that would
> just be a
> >>>>>> wrapper for the tracer and the underlying element so that I could
> access
> >>>>>> the tracer during any element-wise operations in the pipeline.
> >>>>>>
> >>>>>> Any recommendations or suggestions are more than welcome! I'm very
> new to
> >>>>>> the Beam ecosystem, so I'd love to leverage anything out there that
> might
> >>>>>> help me from reinventing the wheel.
> >>>>>>
> >>>>>> Thanks much!
> >>>>>>
> >>>>>> Rion
> >>>>>>
> >>>>>
> >>>
> >>>
> >
>

Re: Distributed Tracing in Apache Beam

Reply via email to