[
https://issues.apache.org/jira/browse/IMPALA-13237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015520#comment-18015520
]
ASF subversion and git services commented on IMPALA-13237:
----------------------------------------------------------
Commit 3910e924d406709419f654eb5af30e22da11f9c5 in impala's branch
refs/heads/master from jasonmfehr
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3910e924d ]
IMPALA-13237: [Patch 7] - Lock ClientRequestState during Opentelemetry Traces
Updates the SpanManager class so it takes the ClientRequestState lock
when reading from that object.
Updates startup flag otel_trace_span_processor to be hidden. Manual
testing revealed that setting this flag to "simple" (which uses
SimpleSpanProcessor when forwarding OpenTelemetry traces) causes the
SpanManager object to block until the destination OpenTelemetry
collector receives the request and responds. Thus, network slowness
or an overloaded OpenTelemetry collector will block the entire query
processing flow since SpanManager will hold the ClientRequestState
lock throughout the duration of the communication with the
OpenTelemetry collector. Since the SimpleSpanProcessor is useful in
testing, this flag was changed to hidden to avoid incorrect usage in
production.
When generating span attribute values on OpenTelemetry traces for
queries, data is read from ClientRequestState without holding its
lock. The documentation in client-request-state.h specifically states
reading most fields requires holding its lock.
An examination of the opentelemetry-cpp SDK code revealed the
ClientRequestState lock must be held until the StartSpan() and
EndSpan() functions complete. The reason is span attribute keys and
values are deep copied from the source nostd::string_view objects
during these functions.
Testing accomplished by running the test_otel_trace.py custom cluster
tests as regression tests. Additionally, manual testing with
intentionally delayed network communication to an OpenTelemetry
collector demonstrated that the StartSpan() and EndSpan() functions
do not block waiting on the OpenTelemetry collector if the batch span
processor is used. However, these functions do block if the simple
span processor is used.
Additionally, a cause of flaky tests was addressed. The custom
cluster tests wait until JSON objects for all traces are written to
the output file. Since each trace JSON object is written on its own
line in the output file, this wait is accomplished by checking the
number of lines in the output file. Occasionally, the traces would be
partially written to the file which satisfied the line count check
but the trace would not be fully written out when the assertion code
loaded it. In these situations, the test failed because a partial
JSON object cannot be loaded. The fix is to wait both for the
expected line count and for the last line to end with a newline
character. This fix ensures that the JSON representing the trace is
fully written to the file before the assert code loads it.
Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: I649bdb6f88176995d45f7d10db898188bbe0b609
Reviewed-on: http://gerrit.cloudera.org:8080/23294
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Send Query Lifecycle Traces to OTel
> -----------------------------------
>
> Key: IMPALA-13237
> URL: https://issues.apache.org/jira/browse/IMPALA-13237
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend, Frontend
> Reporter: Jason Fehr
> Assignee: Jason Fehr
> Priority: Critical
> Labels: observability
>
> Throughout the lifecycle of a query, several events happen. Implement OTel
> traces where each span is one step in the query lifecycle.
> These traces will be send to OTel systems using the OTel SDK.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]