Elek, Marton created HDDS-1935:
----------------------------------

             Summary: Improve the visibility with Ozone Insight tool
                 Key: HDDS-1935
                 URL: https://issues.apache.org/jira/browse/HDDS-1935
             Project: Hadoop Distributed Data Store
          Issue Type: New Feature
            Reporter: Elek, Marton
            Assignee: Elek, Marton




Visibility is a key aspect for the operation of any Ozone cluster. We need 
better visibility to improve correctnes and performance. While the distributed 
tracing is a good tool for improving the visibility of performance we have no 
powerful tool which can be used to check the internal state of the Ozone 
cluster and debug certain correctness issues.

To improve the visibility of the internal components I propose to introduce a 
new command line application `ozone insight`.

The new tool will show the selected metrics / logs / configuration for any of 
the internal components (like replication-manager, pipeline, etc.).

For each insight points we can define the required logs and log levels, metrics 
and configuration and the tool can display only the component specific 
information during the debug.

h2. Usage

First we can check the available insight point:

{code}
bash-4.2$ ozone insight list
Available insight points:


  scm.node-manager                     SCM Datanode management related 
information.
  scm.replica-manager                  SCM closed container replication manager
  scm.event-queue                      Information about the internal async 
event delivery
  scm.protocol.block-location          SCM Block location protocol endpoint
  scm.protocol.container-location      Planned insight point which is not yet 
implemented.
  scm.protocol.datanode                Planned insight point which is not yet 
implemented.
  scm.protocol.security                Planned insight point which is not yet 
implemented.
  scm.http                             Planned insight point which is not yet 
implemented.
  om.key-manager                       OM Key Manager
  om.protocol.client                   Ozone Manager RPC endpoint
  om.http                              Planned insight point which is not yet 
implemented.
  datanode.pipeline[id]                More information about one ratis 
datanode ring.
  datanode.rocksdb                     More information about one ratis 
datanode ring.
  s3g.http                             Planned insight point which is not yet 
implemented.
{code}

Insight points can define configuration, metrics and/or logs. Configuration can 
be displayed based on the configuration objects:

{code}
ozone insight config scm.protocol.block-location
Configuration for `scm.protocol.block-location` (SCM Block location protocol 
endpoint)

>>> ozone.scm.block.client.bind.host
       default: 0.0.0.0
       current: 0.0.0.0

The hostname or IP address used by the SCM block client  endpoint to bind


>>> ozone.scm.block.client.port
       default: 9863
       current: 9863

The port number of the Ozone SCM block client service.


>>> ozone.scm.block.client.address
       default: ${ozone.scm.client.address}
       current: scm

The address of the Ozone SCM block client service. If not defined value of 
ozone.scm.client.address is used

{code}

Metrics can be retrieved from the prometheus entrypoint:

{code}
ozone insight metrics scm.protocol.block-location
Metrics for `scm.protocol.block-location` (SCM Block location protocol endpoint)

RPC connections

  Open connections: 0
  Dropped connections: 0
  Received bytes: 0
  Sent bytes: 0


RPC queue

  RPC average queue time: 0.0
  RPC call queue length: 0


RPC performance

  RPC processing time average: 0.0
  Number of slow calls: 0


Message type counters

  Number of AllocateScmBlock: 0
  Number of DeleteScmKeyBlocks: 0
  Number of GetScmInfo: 2
  Number of SortDatanodes: 0
{code}

Log levels can be adjusted with the existing logLevel servlet and can be 
collected / streamd via a simple logstream servlet:

{code}
ozone insight log scm.node-manager
[SCM] 2019-08-08 12:42:37,392 
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:43:37,392 
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:44:37,392 
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:45:37,393 
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:46:37,392 
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
Processing node report from [datanode=ozone_datanode_1.ozone_default]
{code}

The verbose mode can display the raw messages as well:

{code}
[SCM] 2019-08-08 13:16:37,398 
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 13:16:37,400 
[TRACE|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] HB is 
received from [datanode=ozone_datanode_1.ozone_default]: 
storageReport {
  storageUuid: "DS-bffe6bee-1166-4502-acf5-57fc16c5aa98"
  storageLocation: "/data/hdds"
  capacity: 470282264576
  scmUsed: 16384
  remaining: 205695963136
  storageType: DISK
  failed: false
}

{code}

h2. Use cases

Ozone insight can be used for any kind of debuging. Some problem examples from 
my yesterday

 1. Due to a cache problem the volumes were created twice without any error at 
the second time. With this tool I can check the state of the internal cache, or 
check if the volume is added to the rocksdb itself.

 2. After fixing this problem we found an DNS caching issue. The OM responded 
with an error but it was not clear where the error was propagated from (it was 
created in OzoneManagerProtocolClientSideTranslatorPB.handleError). With 
checking the traffic between SCM and OM it can be easy to track the origin of a 
specific error.
 
 4. After fixing this problem we found some pipline problem (reported later at 
HDDS-1933). With this tool I could check the content of the reports and 
messages to the pipeline manager.

 


h2. Implementation

We can implement the tool without any significant code change as it uses 
existing features:

 * Metrics can be downloaded from the `/prom` endpoint
 * Log Level can be set with the existing `/logLevel` servlet endpoint (from 
hadoop-common)
 * Log lines can be streamed with a very simple new servlet
 * Configuration can be displayed based on configuration points

A new interface can be introduced for `InsightPoint`s. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to