I have not tried to do this, but as we (Cloudera) deal with more and more performance related problems, I feel something like this is needed.
It is a tricky problem due to the number of requests the NN handles and how performance sensitive it is. At the IPC Server level, we should be able to know the request queue time, processing time, response queue time and the type of request. If we sampled X% of requests and then emitted one log line per interval (eg per minute), we could perhaps build a histogram of queue size, queue times, processing times per request type. >From JMX, we can get the request counts and queue length, but I am not sure if we can get something like percentiles or queue time and processing time over the previous minute for example? Even given the above details, if we see a long queue length, it may still remain a mystery about what was causing that queue. Often it is due to a long running request (eg contentSummay, snapshotdiff etc) holding the NN lock in write mode for too long. What would be very useful, is a way to see the percentage of time the NN lock is held in Exclusive mode (write), shared mode (read) or not held at all (rare on a busy cluster). Even better if we can somehow bubble up the top requests holding the lock in exclusive mode. Perhaps sampling the time spent waiting to acquire the lock could be useful too. I also think it would be useful to expose response times from the client perspective. https://issues.apache.org/jira/browse/HDFS-14084 seemed interesting and could be worth finishing. I also found https://issues.apache.org/jira/browse/HDFS-12861 some time back to get the client to log data read speeds. Have you made any attempts in this area so far, and did you have any success? Thanks, Stephen. On Thu, Mar 18, 2021 at 5:41 AM Fengnan Li <loyal...@gmail.com> wrote: > Hi community, > > > > Has someone ever tried to implement sampling logging for ipc Server? We > would like to gain more observability for all of the traffics to the > Namenode. > > > > Thanks, > Fengnan > >