[ 
https://issues.apache.org/jira/browse/FLINK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Gao updated FLINK-10981:
----------------------------
    Description: 
Currently the network layer has provided two metrics items, namely 
_InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge_ to show the usage 
of input buffer pool and output buffer pool. When there are multiple 
inputs(SingleInputGate) or outputs(ResultPartition), the two metrics items show 
their average usage. 

 

However, we found that the maximum usage of all the InputBufferPool or 
OutputBufferPool is also useful in debugging back pressure. Suppose we have a 
job with the following job graph:

 
{code:java}
          F     
           \
            \
            _\/      
A ---> B ----> C ---> D
       \
        \
         \-> E 
         {code}
Besides, also suppose D is very slow and thus cause back pressure, but E is 
very fast and F outputs few records, thus the usage of the corresponding 
input/output buffer pool is almost 0.

 

Then the average input/output buffer usage of each task will be:

 
{code:java}
A(100%) --> (100%) B (50%) --> (50%) C (100%) --> (100%) D
{code}
 

 

But the maximum input/output buffer usage of each task will be:

 
{code:java}
A(100%) --> (100%) B (100%) --> (100%) C (100%) --> (100%) D
{code}
Users will be able to find the slowest task by finding the first task whose 
input buffer usage is 100% but output usage is less than 100%.

 

 

If it is reasonable to show the maximum input/output buffer usage, I think 
there may be three options:
 # Modify the current computation logic of _InputBufferPoolUsageGauge_ and 
_OutputBufferPoolUsageGauge._
 # Add two _new metrics items InputBufferPoolMaxUsageGauge and 
OutputBufferPoolUsageGauge._
 # Try to show distinct usage for each input/output buffer pool.

and I think maybe the second option is the most preferred. 

 

How do you think about that?

 

 

 

  was:
Currently the network layer has provided two metrics items, namely 
_InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge_ to show the usage 
of input buffer pool and output buffer pool. When there are multiple 
inputs(SingleInputGate) or __ outputs(ResultPartition), the two metrics items 
show their average usage. 

 

However, we found that the maximum usage of all the InputBufferPool or 
OutputBufferPool is also useful in debugging back pressure. Suppose we have a 
job with the following job graph:

 
{code:java}
          F     
           \
            \
            _\/      
A ---> B ----> C ---> D
       \
        \
         \-> E 
         {code}
Besides, also suppose D is very slow and thus cause back pressure, but E is 
very fast and F outputs few records, thus the usage of the corresponding 
input/output buffer pool is almost 0.

 

Then the average input/output buffer usage of each task will be:

 
{code:java}
A(100%) --> (100%) B (50%) --> (50%) C (100%) --> (100%) D
{code}
 

 

But the maximum input/output buffer usage of each task will be:

 
{code:java}
A(100%) --> (100%) B (100%) --> (100%) C (100%) --> (100%) D
{code}
Users will be able to find the slowest task by finding the first task whose 
input buffer usage is 100% but output usage is less than 100%.

 

 

If it is reasonable to show the maximum input/output buffer usage, I think 
there may be three options:
 # Modify the current computation logic of _InputBufferPoolUsageGauge_ and 
_OutputBufferPoolUsageGauge._
 # Add two __ new metrics items I_nputBufferPoolMaxUsageGauge and 
OutputBufferPoolUsageGauge._
 # Try to show distinct usage for each input/output buffer pool.

and I think maybe the second option is the most preferred. 

 

How do you think about that?

 

 

 


> Add or modify metrics to show the maximum usage of 
> InputBufferPool/OutputBufferPool to help debugging back pressure
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10981
>                 URL: https://issues.apache.org/jira/browse/FLINK-10981
>             Project: Flink
>          Issue Type: Improvement
>          Components: Metrics, Network
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>
> Currently the network layer has provided two metrics items, namely 
> _InputBufferPoolUsageGauge_ and _OutputBufferPoolUsageGauge_ to show the 
> usage of input buffer pool and output buffer pool. When there are multiple 
> inputs(SingleInputGate) or outputs(ResultPartition), the two metrics items 
> show their average usage. 
>  
> However, we found that the maximum usage of all the InputBufferPool or 
> OutputBufferPool is also useful in debugging back pressure. Suppose we have a 
> job with the following job graph:
>  
> {code:java}
>           F     
>            \
>             \
>             _\/      
> A ---> B ----> C ---> D
>        \
>         \
>          \-> E 
>          {code}
> Besides, also suppose D is very slow and thus cause back pressure, but E is 
> very fast and F outputs few records, thus the usage of the corresponding 
> input/output buffer pool is almost 0.
>  
> Then the average input/output buffer usage of each task will be:
>  
> {code:java}
> A(100%) --> (100%) B (50%) --> (50%) C (100%) --> (100%) D
> {code}
>  
>  
> But the maximum input/output buffer usage of each task will be:
>  
> {code:java}
> A(100%) --> (100%) B (100%) --> (100%) C (100%) --> (100%) D
> {code}
> Users will be able to find the slowest task by finding the first task whose 
> input buffer usage is 100% but output usage is less than 100%.
>  
>  
> If it is reasonable to show the maximum input/output buffer usage, I think 
> there may be three options:
>  # Modify the current computation logic of _InputBufferPoolUsageGauge_ and 
> _OutputBufferPoolUsageGauge._
>  # Add two _new metrics items InputBufferPoolMaxUsageGauge and 
> OutputBufferPoolUsageGauge._
>  # Try to show distinct usage for each input/output buffer pool.
> and I think maybe the second option is the most preferred. 
>  
> How do you think about that?
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to