Elek, Marton created HDDS-1636:
----------------------------------

             Summary: Tracing id is not propagated via async datanode grpc call
                 Key: HDDS-1636
                 URL: https://issues.apache.org/jira/browse/HDDS-1636
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
            Reporter: Elek, Marton


Recently a new exception become visible in the datanode logs, using standard 
freon (STANDLAONE)

{code}
datanode_2  | 2019-06-03 12:18:21 WARN  
PropagationRegistry$ExceptionCatchingExtractorDecorator:60 - Error when 
extracting SpanContext from carrier. Handling gracefully.
datanode_2  | 
io.jaegertracing.internal.exceptions.MalformedTracerStateStringException: 
String does not match tracer state format: 
7576cabf-37a4-4232-9729-939a3fdb68c4WriteChunk150a8a848a951784256ca0801f7d9cf8b_stream_ed583cee-9552-4f1a-8c77-63f7d07b755f_chunk_1
datanode_2  |   at 
org.apache.hadoop.hdds.tracing.StringCodec.extract(StringCodec.java:49)
datanode_2  |   at 
org.apache.hadoop.hdds.tracing.StringCodec.extract(StringCodec.java:34)
datanode_2  |   at 
io.jaegertracing.internal.PropagationRegistry$ExceptionCatchingExtractorDecorator.extract(PropagationRegistry.java:57)
datanode_2  |   at 
io.jaegertracing.internal.JaegerTracer.extract(JaegerTracer.java:208)
datanode_2  |   at 
io.jaegertracing.internal.JaegerTracer.extract(JaegerTracer.java:61)
datanode_2  |   at 
io.opentracing.util.GlobalTracer.extract(GlobalTracer.java:143)
datanode_2  |   at 
org.apache.hadoop.hdds.tracing.TracingUtil.importAndCreateScope(TracingUtil.java:102)
datanode_2  |   at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:148)
datanode_2  |   at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:73)
datanode_2  |   at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:61)
datanode_2  |   at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:248)
datanode_2  |   at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
datanode_2  |   at 
org.apache.ratis.thirdparty.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)
datanode_2  |   at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
datanode_2  |   at 
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:46)
datanode_2  |   at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:263)
datanode_2  |   at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:686)
datanode_2  |   at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
datanode_2  |   at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
datanode_2  |   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
datanode_2  |   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
{code}

It turned out that the tracingId propagation between XCeiverClient and Server 
doesn't work very well (in case of Standalone and async commands)

 1. there are many places (on the client side) where the traceId filled with  
UUID.randomUUID().toString();  
 2. This random id is propagated between the Output/InputStream and different 
part of the clients
 3. It is unnecessary, because in the XceiverClientGrpc and XceiverClientGrpc 
the traceId field is overridden with the real opentracing id anyway 
(sendCommand/sendCommandAsync)
 4. Except in the XceiverClientGrpc.sendCommandAsync where this part is 
accidentally missing.

Things to fix:

 1. fix XceiverClientGrpc.sendCommandAsync (replace any existing traceId with 
the good one)
 2. remove the usage of the UUID based traceId (it's not used)
 3. Improve the error logging in case of an invalid traceId on the server side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to