This is an automated email from the ASF dual-hosted git repository.

pcongiusti pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/camel.git


The following commit(s) were added to refs/heads/main by this push:
     new ec86ef8aa3d feat: tracing redesign proposal
ec86ef8aa3d is described below

commit ec86ef8aa3d98831c9b7462faea61dad823c5163
Author: Pasquale Congiusti <[email protected]>
AuthorDate: Thu Jan 9 13:20:39 2025 +0100

    feat: tracing redesign proposal
---
 proposals/tracing.adoc | 119 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)

diff --git a/proposals/tracing.adoc b/proposals/tracing.adoc
new file mode 100644
index 00000000000..a79c8eed539
--- /dev/null
+++ b/proposals/tracing.adoc
@@ -0,0 +1,119 @@
+---
+title: Tracing redesign
+authors:
+  - "@squakez"
+reviewers: []
+approvers: []
+creation-date: 2025-01-08
+last-updated: 2025-01-08
+status: draft
+see-also: []
+replaces: []
+superseded-by: []
+---
+
+== Summary
+
+Tracing and telemetry features are a pillar of application observability, 
above all when the applications are deployed in cloud environments and/or in 
distributed systems in general. During the last years we have observed an 
increasing demand in usage of telemetry components, above all, the usage of 
https://www.cncf.io/projects/opentelemetry/[CNCF project Opentelemetry] when 
running Camel application on cloud environments.
+
+== Motivation
+
+The increase of usage in these components is also opening questions and 
showing potential flaws in the actual design of this feature. Recently we 
needed to work on several issues in order to enhance the component or fix 
failing behaviors which resulted in a increasingly difficult maintenance of the 
code. Most of the time we had to change the implementation, the abstraction and 
even the core dependencies in order to make things work: this is a symptoms 
that we probably need to think on a  [...]
+
+== Goals
+
+Goal of this proposal is to analyze the actual design, the challenges we are 
facing and provide any alternative design to have a simpler long term 
maintenance.
+
+== Context
+
+Camel framework had originally an abstract component, `camel-tracing`, whose 
goal was to create a generic tracing lifecycle to be implemented concretely by 
specific technologies, such as `camel-opentelemetry`. The abstraction should 
take care of generic concepts (like when to create a new span according Camel 
eventing model). The implementation should concretely take care to instantiate 
unique traces and provide the mechanics required to pull/push such traces to a 
trace collector system  [...]
+
+Any user that want to provide the tracing feature is required to include the 
component dependency and any specific configuration. The framework would take 
care to wire the Camel activity to a collection of traces.
+
+I've performed a deep analysis in the last weeks, trying to figure it out 
which are the major problems we need to tackle and I came to the conclusion 
that the actual design may require some review in order to set the base for a 
stronger longer term maintenance. It follows a list of points that I think 
require attention when planning any future development.
+
+=== Unclear tracing scope specification
+
+We have not a clear specification of what a **trace** or a **span** represents 
from Camel point of view. We are thinking of this as a generic unit of work, 
mostly, without a clear definition of how that is bound to any Camel resource. 
There is no documentation around that, requiring the user to intuitively 
understand how a trace maps to Camel domain model.
+
+=== Implementation details slipped in the abstraction
+
+During the past we introduced certain developments that required the 
abstraction to be aware of certain implementation details, such as 
`Autoclosable` Opentelemetry scopes. Also, we have certain developments that 
are missing the required abstraction, making them specific of the 
implementation (for example, Opentelemetry processor traces).
+
+=== Ad hoc "side" features implementations
+
+The implementations we are using are offering their specific way to expose 
certain "side" features, for example, set the traces ids into MDC. However we 
do have our own implementation that is either conflicting or not working 
properly as it relies on a context propagation which is generally part of the 
tracing/telemetry implementation.
+
+=== Inconsistent context storage
+
+The abstraction (in `camel-tracing`) is taking care to maintain a stack based 
structure for each created span which is stored in the *Exchange*. The data 
structure is also taking care to maintain a hierarchy relationship between the 
different spans created during an Exchange execution. However, the 
implementation we have in `camel-opentelemetry` is mixing up this mechanism 
with its own storing mechanism which is based on Java ThreadLocal context. 
Additionally we have implemented a contex [...]
+
+=== Async exchange boundaries
+
+With the actual design, Camel creates a new trace when it create an Exchange 
and later add span for each process. However, when we are creating an 
asynchronous Exchange (ie, wiretap EIP), this is considered as part of the 
original Exchange, and, with it, all the new Exchange execution. The result in 
the trace collector tool is that the new Exchange overflow the execution of the 
source Exchange.
+
+== Proposal
+
+Before digging deep in the new design, we need to make an important 
consideration related to how Camel works and how the major telemetry component 
we want to consider (Opentelemetry) would require certain transformations. As 
mentioned in the "Inconsistent context storage" section, the Opentelemetry 
works on the assumption that any application can easily propagate the context 
to the threading model of such application. This is not the case of Camel, 
above all because the system is very mu [...]
+
+The new design should not change how the core of the application works. We 
must be implementation agnostic, so the design should be flexible enough to 
adapt to any future implementation and avoid any important future refactoring.
+
+I advocate to move back to the root of the original abstract component, first 
of all, defining the trace specification meaning for Camel (tracing 
**structure**). Later we should provide a clear and flexible **lifecycle** for 
the traces (creation, activation, ...): this is probably the abstract part we 
will need to delegate to **implementation specific ** components. In order to 
avoid depending on consistency problems, we should exclusively use the Exchange 
as a mean to store and define t [...]
+
+=== Tracing structure
+
+Right now we have two main span levels. The root span, which is created for 
each execution of a Route, and later a series of events spans, which are the 
various processes executed by the Route. We are missing the creation of a root 
span which should happen when a new Exchange is created. In this way we can 
easily trace the activity of each Exchange, identifying each trace by the 
Exchange ID. According to this proposal, the new trace structure will be 
composed of:
+
+1. a root trace for each Exchange (identified uniquely)
+2. one or more span for each Route of that Exchange execution
+3. one or more trace for each event executed in the Route
+
+With this structure we'd be able to capture also any asynchronous Exchange 
which may be generated asynchronously, being able to trace such execution 
separately from the original parent trace (IE, solving "Async exchange 
boundaries" problem) as the new generated Exchange has a different ID.
+
+=== Tracing lifecycle
+
+The `camel-tracing` component should be the one in charge to manage the trace 
lifecycle. Any implementation specific behavior has to adapt to this lifecycle, 
likely implementing the required logic in those abstract methods exposed by the 
component. At this stage of design, we can identify those function as:
+
+* Span creation
+* Span activation
+* Span deactivation
+* Span closure
+
+The **creation** method would be in charge to create a new root trace or a new 
span within an existing trace. The **activation** method is the one in charge 
to tell the tracing system a given span is the one active at any given moment. 
The **deactivation** should be the one used to turn a given span off. The 
**closure** method is finally the one in charge to finalize a given span and 
the trace when this is the case.
+
+The above definition may feel redundant as in this moment we may probably need 
only a creation/activation method and a deactivation/closure method. However, 
in order to give more flexibility to the abstraction, we must make sure to meet 
any future requirement by any tracing technology.
+
+This design is very similar to the original component design. However, we need 
to remove the implementation specific details from the abstraction entirely. 
What is also important is that we entirely leverage the component storage to 
retrieve the current span and do with it the needful action. With this proposal 
we will also need to remove from the core components certain logic we had 
introduced in the past in order to support some features (ie, 
`ExchangeAsyncProcessingStartedEvent` imple [...]
+
+Beside the span lifecycle we will need to consider a few more aspects:
+
+* Span decoration
+* Context propagation
+
+The **span decoration** is a Camel specific way of decorating the different 
components we handle with specific traces information. As an example, when 
you're using Kafka component, you will get automatically in the trace useful 
configuration as the offset or the partition. We already have this mechanism in 
place and we should make sure to have a clear documentation stating about this 
particular feature.
+
+The **Context propagation** is a way to correlate distributed traces between 
each other. It works reading a `traceparent` header on the Exchange and using 
it to correlate to a chain of distributed requests. It's important to notice 
that the specific propagation mechanism belong to the implementation, so we 
will need to provide in the component the required level of abstraction.
+
+=== Tracing storage
+
+The Exchange stack storage already exists and it may suffice to this proposal 
goals. Again, we need to remove the implementation specific details from the 
abstraction and make sure that we don't slip any implementation detail in the 
future by design. Some concern we may have would be about the correct handling 
of opening and closure of spans which may be different according the each 
implementation specific. However, if the lifecycle we have in place takes care 
of consistency, this should [...]
+
+In order to clarify this aspect, let's take `camel-opentelemetry` as an 
example. When we call the *activation* method, then, we must make sure that the 
span passed is correctly activated, calling therefore the `span.makeCurrent()` 
method. The generated scope has therefore to be kept in the same span wrapper 
in order to be later closed when the *closure* method is called via 
`scope.close()`. As each span wrapper is stored in the Exchange, then we can 
use this approach to maintain the stat [...]
+
+=== Tracing simple implementation (mock)
+
+If we move most of the logic into the abstraction, the implementation of a 
simple implementation should be straightforward. We can expect this 
implementation in charge to implement the abstraction methods provided in the 
"tracing lifecycle" section, which can be some simple UUID generation and the 
tracing into MDC variables in order to simply log them in the application log. 
No push/pull to any collector is expected and this implementation would serve 
more as a way to debug the abstracti [...]
+
+=== Tracing specific implementations
+
+The feature specific implementation should be therefore limited to the 
implementation of the abstract methods, as it would happen in the simple 
implementation. With this approach we are limiting to the bare minimum the 
maintenance of each specific technology. With this proposal we will need to 
rework massively on the reduction of code in the existing implementations 
(`camel-opentelemetry`).
+
+== Backward compatibility
+
+This design proposals may introduce certain breaking compatibility changes, 
reason why we must clarify the scope and plan the work in order to avoid adding 
breaking compatibility within any non major version. If we agree with this 
design, then we can work on an iterative development which has to be compatible 
with the existing specification.
+
+What is surely going to be developed into a major release is the "Tracing 
structure" part. Here we need to introduce a different trace organization than 
the one we have today. However, deferring this development to a major release 
is not a blocker to the rest of the work. The rest of changes can be probably 
performed in within the regular minor release work.
+
+== Tracing refactoring POC
+
+In order to prove most of the above assumptions, I've developed a simple POC 
which I used as a 
https://github.com/squakez/camel/tree/feat/tracing_refactoring[base for this 
proposal]. Testing this against some application, we can see traces are managed 
correctly and in line with the structure proposed in this document.

Reply via email to