+dev <d...@beam.apache.org> I don't have a ton of time to dig in to this, but I wanted to say that this is very cool and just drop a couple pointers (which you may already know about) like Explaining Outputs in Modern Data Analytics [1] which was covered by The Morning Paper [2]. This just happens to be something I read a few years ago - following citations of the paper or pingbacks on the blog post yields a lot more work, some of which may be helpful. There seems to be a slight difference of emphasis between tracing in an arbitrary distributed system versus explaining big data results. I would expect general tracing (which Jaeger is?) to be more complex and expensive to run, but that's just an intuition.
Kenn [1] http://www.vldb.org/pvldb/vol9/p1137-chothia.pdf [2] https://blog.acolyer.org/2017/02/01/explaining-outputs-in-modern-data-analytics/ On Fri, Apr 17, 2020 at 10:56 AM Rion Williams <rionmons...@gmail.com> wrote: > Hi Alexey, > > I think you’re right about the wrapper, it’s likely unnecessary as I think > I’d have enough information in the headers to rehydrate my “tracer” that > communicates the traces/spans to Jaeger as needed. I’d love to not have to > touch those or muddy the waters with a wrapper class, additional conversion > steps, custom coder, etc. > > Speaking of conversions, I agree entirely with the unified interface for > reading/writing to Kafka. I’ll openly admit I spent far too long fighting > with it before discovering that `withoutMetadata()` existed. So if those > were unified and writeRecords could accept a Kafka one, that’d be great. > > > > On Apr 17, 2020, at 12:47 PM, Alexey Romanenko <aromanenko....@gmail.com> > wrote: > > > > Hi Rion, > > > > In general, yes, it sounds reasonable to me. I just do not see why you > need to have extra Traceable wrapper? Do you need to keep some temporary > information there that you don’t want to store in Kafka record headers? > > > > PS: Now I started to think that we probably have to change an interface > of KafkaIO.writeRecords() from ProducerRecord to the same KafkaRecord as we > use for read. In this case we won’t expose Kafka API and use only own > wrapper. > > Also, user won’t need to convert types between Read and Write (like in > this topic case). > > > >> On 17 Apr 2020, at 19:28, Rion Williams <rionmons...@gmail.com> wrote: > >> > >> Hi Alexey, > >> > >> So this is currently the approach that I'm taking. Basically creating a > wrapper Traceable<K,V> class that will contain all of my record information > as well as the data necessary to update the traces for that record. It > requires an extra step and will likely mean persisting something along side > each record as it comes in, but I'm not sure if there's another way around > it. > >> > >> My current approach goes something like this: > >> - Read records via KafkaIO (with metadata) > >> - Apply a transform to convert all KafkaRecord<K,V> into Traceable<K,V> > instances (which just contain a Tracer object as well as the original > KV<K,V> record itself) > >> - Pass this Traceable through all of the appropriate transforms, > creating new spans for the trace as necessary via the tracer element on the > Traceable<K,V> object. > >> - Prior to output to a Kafka topic, transform the Traceable<K, V> > object in to a ProducerRecord<K,V> that contains the key, value, and > headers (from the tracer) prior to writing back to Kafka > >> > >> I think this will work, but it'll likely take quite a bit of > experimentation to verify. Does this sound reasonable? > >> > >> Thanks, > >> > >> Rion > >> > >>> On 2020/04/17 17:14:58, Alexey Romanenko <aromanenko....@gmail.com> > wrote: > >>> Not sure if it will help, but KafkaIO allows to keep all meta > information while reading (using KafkaRecord) and writing (using > ProducerRecord). > >>> So, you can keep your tracing id in the record headers as you did with > Kafka Streams. > >>> > >>>> On 17 Apr 2020, at 18:58, Rion Williams <rionmons...@gmail.com> > wrote: > >>>> > >>>> Hi Alex, > >>>> > >>>> As mentioned before, I'm in the process of migrating a pipeline of > several Kafka Streams applications over to Apache Beam and I'm hoping to > leverage the tracing infrastructure that I had established using Jaeger > whenever I can, but specifically to trace an element as it flows through a > pipeline or potentially multiple pipelines. > >>>> > >>>> An example might go something like this: > >>>> > >>>> - An event is produced from some service and sent to a Kafka Topic > (with a tracing id in the headers) > >>>> - The event enters my pipeline (Beam reads from that topic) and > begins applying a series of transforms that evaluate the element itself > (e.g. does it have any users associated with it, IP addresses, other > interesting information). > >>>> - When interesting information is encountered on the element (or > errors), I'd like to be able to associate them with the trace (e.g. a user > was found, this is some information about the user, this is the unique > identifier associated with them, or there was an error because the user had > a malformed e-mail address) > >>>> - The traces themselves would be cumulative, so if an event was > processed through one pipeline, it would contain all the necessary tracing > headers in the message so if another pipeline picked it up from its > destination topic (e.g. the destination of the first pipeline), the trace > could be continued. > >>>> > >>>> I think that being able to pick up interactive systems would be a > nice to have (e.g. this record is being sent to Parquet, Mongo, or some > other topic), but I'm just trying to focus on being able to add to the > trace at the ParDo/element level for now. > >>>> > >>>> I hope that helps. > >>>> > >>>> Rion > >>>> > >>>> On 2020/04/17 16:30:14, Alex Van Boxel <a...@vanboxel.be> wrote: > >>>>> Can you explain a bit more of what you want to achieve here? > >>>>> > >>>>> Do you want to trace how your elements go to the pipeline or do you > want to > >>>>> see how every ParDo interacts with external systems? > >>>>> > >>>>> On Fri, Apr 17, 2020, 17:38 Rion Williams <rionmons...@gmail.com> > wrote: > >>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> I'm reaching out today to inquire if Apache Beam has any support or > >>>>>> mechanisms to support some type of distributed tracing similar to > something > >>>>>> like Jaeger (https://www.jaegertracing.io/). Jaeger itself has a > Java > >>>>>> SDK, however due to the nature of Beam working with transforms that > yield > >>>>>> immutable collections, I wasn't sure what the best avenue would be > to > >>>>>> correlate various transforms against a particular element would be? > >>>>>> > >>>>>> Coming from a Kafka Streams background, this process was pretty > trivial as > >>>>>> I'd simply store my correlation identifier within the message > headers and > >>>>>> those would be persisted as the element traveled through Kafka into > various > >>>>>> applications and topics. I'm hoping to still leverage some of that > in Beam > >>>>>> if at all possible or see what, if any, recommended approaches > there are > >>>>>> out there. > >>>>>> > >>>>>> My current approach involves the creation of a "Tracing Context" > which > >>>>>> would just be a wrapper for each of my elements that had their own > >>>>>> associated trace with them and instead of just passing around a > >>>>>> PCollection<X> I would use a PCollection<Tracable<X>> that would > just be a > >>>>>> wrapper for the tracer and the underlying element so that I could > access > >>>>>> the tracer during any element-wise operations in the pipeline. > >>>>>> > >>>>>> Any recommendations or suggestions are more than welcome! I'm very > new to > >>>>>> the Beam ecosystem, so I'd love to leverage anything out there that > might > >>>>>> help me from reinventing the wheel. > >>>>>> > >>>>>> Thanks much! > >>>>>> > >>>>>> Rion > >>>>>> > >>>>> > >>> > >>> > > >