Re: Lightweight profiling of messages processing

Vladimir Ozerov Tue, 27 Nov 2018 07:04:34 -0800

Clarification: p.5 is about buffer pool and disk access statistics, not
network.


On Tue, Nov 27, 2018 at 6:02 PM Vladimir Ozerov <voze...@gridgain.com>
wrote:

> Alexey,
>
> I would say that to implement this feature successfully, we first need to
> clearly understand specific scenarios we want to target, and only then plan
> implementation in small iterations. Please take in count the following
> points:
> 1) Splitting by message type might be too fine-grained thing for the first
> iteration. For example, we have N different message types processed in a
> system pool. Add several messages causing long-running tasks (e.g. invoke),
> and deque time will grow for all other message types for the same queue. I
> think we can start with a simpler metrics - per-pool queueing time, pool
> throughput, pool task latencies. This would be enough for a wide number of
> use cases. We may expand it in future if needed.
> 2) Some IO statistics were already implemented as a part of IGNITE-6868
> [1]. See TcpCommunicationSpiMBeanImpl. You may find it useful for your task.
> 3) Public API should be designed upfront. Normally it should include JMX
> and system SQL views (e.g. like [2]). JMX is useful for external tools, SQL
> views allow access to performance data from all platforms all at once. This
> is critical.
> 4) It should be possible to enable/disable such metrics in runtime.
> Probably infrastructure from IGNITE-369 [3] might be reused for this
> purpose (see GridCacheProcessor.EnableStatisticsFuture)
> 5) IO access statistics are already under development [4]. First part will
> be merged very soon.
> 5) Performance effects must be measured extremely carefully in all modes
> (in-memory, persistence, background/log_only/fsync), because you are likely
> to change very performance sensitive code pieces. We had a lot of
> performance issues when implementing IGNITE-6868 [1].
>
> Vladimir.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-6868
> [1] https://issues.apache.org/jira/browse/IGNITE-7700
> [3] https://issues.apache.org/jira/browse/IGNITE-369
> [4]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-27%3A+Page+IO+statistics
>
> On Tue, Nov 27, 2018 at 5:04 PM Alexey Kukushkin <
> kukushkinale...@gmail.com> wrote:
>
>> Hi Alexei,
>>
>> Did you consider general-purpose APM
>> <https://en.wikipedia.org/wiki/Application_performance_management> tools
>> like free InspectIT <https://github.com/inspectIT/inspectIT> or
>> commercial
>> DynaTrace <https://www.dynatrace.com/> or AppDynamics
>> <https://www.appdynamics.com/>?
>>
>> Java APM tools do not require writing code or even instrumenting Ignite
>> binaries: you just attach them as javaagent
>> <https://zeroturnaround.com/rebellabs/how-to-inspect-classes-in-your-jvm/>
>> to
>> Ignite JVM and then you can configure "sensors" to track whatever API you
>> want. APM tools collect metrics from the sensors and provide sophisticated
>> analysis tools.
>>
>> DynaTrace claims they have 2%-7% overhead (depending on application) but
>> you can always detach the tool if you do not always need it in production.
>>
>> I did not try APM for Ignite myself but it might work well.
>>
>> On Tue, Nov 27, 2018 at 4:37 PM Alexei Scherbakov <
>> alexey.scherbak...@gmail.com> wrote:
>>
>> > Igniters,
>> >
>> > At work I often have to solve performance issues with Ignite cluster
>> > without having access to source code of running user application.
>> >
>> > Looks like Ignite have limited capabilities to identify bottlenecks
>> without
>> > extensive profiling on server and client side (JFR recording , sampling
>> > profilers, regular thread dumps,  etc), which is not always possible.
>> >
>> > Even having profiling data not always helpful for determining several
>> types
>> > of bottlenecks, on example, if where is a contention on single
>> > key/partition.
>> >
>> > I propose to implement new feature, like lightweight profiling of
>> message
>> > processing.
>> >
>> > The feature will allow to have view on message processing statistics for
>> > each node and for all grid nodes.
>> >
>> > In short, it's necessary to track each message execution in executors
>> and
>> > record execution statistics like synchronous execution time in executor
>> > thread and waiting time in queue.
>> >
>> > Full description:
>> >
>> > 1. Implement detailed tracking of message waiting in queue and actual
>> > processing by executors with splitting to several time bins. Example of
>> > detailed statistics for each processed message:
>> >
>> >
>> > Processing time(%)
>> >                                  Message       Total     Average(ms)
>> >      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
>> > 500ms   < 750ms  < 1000ms  > 1000ms
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >                 GridNearSingleGetRequest   904311372        0.023000
>> >   904240521     57394      7242      3961      1932       229
>> > 61        24         4         4
>> >                GridNearSingleGetResponse   340134416        0.041000
>> >   340118791     11660      1167       729       901      1001
>> > 158         8         1         0
>> >                      GridNearLockRequest    77088689        0.079000
>> >    77073458     11945      2299       643       311        31
>> > 2         0         0         0
>> >  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
>> >    39580914     28222      6469      4638      9870     13414
>> > 2087       137         1         0
>> >         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
>> >    37579375     23247      5915      4210      8954     12917
>> > 2048       162         1         0
>> >      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
>> >    33579805       337        51         3         1         1
>> > 0         0         0         0
>> >                 GridNearTxPrepareRequest    21667790        0.238000
>> >    21078069    580126      1622      1261      2531      3631
>> > 496        40         0        14
>> >                 GridDhtTxPrepareResponse    20949873        0.316000
>> >    17130161   3803105      4615      3162      4489      3721
>> > 577        34         1         8
>> >                  GridDhtTxPrepareRequest    20949838        0.501000
>> >    16158732   4750217     16183      5735      8472      8994
>> > 1353        88        11        53
>> >                  GridDhtTxFinishResponse    13835065        0.007000
>> >    13832519      2476        27        28        14         1
>> > 0         0         0         0
>> >                   GridDhtTxFinishRequest    13835028        0.547000
>> >    12084106   1736789      8971      2340      1792       807
>> > 118        41         4        60
>> >                  GridNearTxFinishRequest    13762197        0.725000
>> >    11811828   1942499      4441      1400      1201       524
>> > 89        34        19       162
>> >                GridDhtAtomicNearResponse     2784422        0.122000
>> >     2783393      1022         5         2         0         0
>> > 0         0         0         0
>> >                       GridNearGetRequest     2360483        0.484000
>> >     2345937     14129       244       101        64         8
>> > 0         0         0         0
>> >                      GridNearGetResponse     1984243        0.054000
>> >     1981905      2327         8         1         1         1
>> > 0         0         0         0
>> >                GridNearTxPrepareResponse      192856        0.153000
>> >      192660       188         1         5         1         1
>> > 0         0         0         0
>> >                     GridNearLockResponse      192780        0.091000
>> >      192667       107         3         0         3         0
>> > 0         0         0         0
>> >                 GridNearTxFinishResponse         177        0.822000
>> >         129        47         1         0         0         0
>> > 0         0         0         0
>> >        GridNearAtomicSingleUpdateRequest         124        4.803000
>> >          52        53        19         0         0         0
>> > 0         0         0         0
>> >             GridNearAtomicUpdateResponse         120        0.448000
>> >         110        10         0         0         0         0
>> > 0         0         0         0
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >                                           1544912252
>> >  1531765132  12965900     59283     28219     40537     45281
>> > 6989       568        42       301
>> >
>> >
>> > Queue waiting time(%)
>> >                                  Message       Total     Average(ms)
>> >      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
>> > 500ms   < 750ms  < 1000ms  > 1000ms
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >                 GridNearSingleGetRequest   904311372        0.023000
>> >   903130400   1051499     33248     18049     34044     38809
>> > 4714       212         8       389
>> >                GridNearSingleGetResponse   340134416        0.041000
>> >   339708055    419479      3732      1004      1222       737
>> > 117         2         1        67
>> >                      GridNearLockRequest    77088689        0.079000
>> >    76949792    129666      2003      1624      2585      2531
>> > 390        18         2        78
>> >  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
>> >    39572801     53913      3949      2960      5487      5583
>> > 963        45         5        46
>> >         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
>> >    37561043     57294      3721      2736      5339      5716
>> > 879        54         6        41
>> >      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
>> >    33512734     45873      4427      3340      6342      6363
>> > 1033        50        10        26
>> >                 GridNearTxPrepareRequest    21667790        0.238000
>> >    21584416     76283      1681      1026      1948      2027
>> > 330        31         2        46
>> >                 GridDhtTxPrepareResponse    20949873        0.316000
>> >    20862632     83578      1214       633       821       817
>> > 118        26         8        26
>> >                  GridDhtTxPrepareRequest    20949838        0.501000
>> >    20867551     72264      2265      1516      2773      2905
>> > 479        38         6        41
>> >                  GridDhtTxFinishResponse    13835065        0.007000
>> >    13808274     26487       150        68        55        21
>> > 3         1         0         6
>> >                   GridDhtTxFinishRequest    13835028        0.547000
>> >    13809380     25142       188       127       102        59
>> > 9         2         4        15
>> >                  GridNearTxFinishRequest    13762197        0.725000
>> >    13743925     17365       241       171       278       191
>> > 16         4         0         6
>> >                GridDhtAtomicNearResponse     2784422        0.122000
>> >     2780600      3767        29         9         7         9
>> > 1         0         0         0
>> >                       GridNearGetRequest     2360483        0.484000
>> >     2355963      3762       194       114       184       228
>> > 38         0         0         0
>> >                      GridNearGetResponse     1984243        0.054000
>> >     1979404      4721        31         9        49        28
>> > 1         0         0         0
>> >                GridNearTxPrepareResponse      192856        0.153000
>> >      192852         2         0         1         1         0
>> > 0         0         0         0
>> >                     GridNearLockResponse      192780        0.091000
>> >      192777         3         0         0         0         0
>> > 0         0         0         0
>> >                 GridNearTxFinishResponse         177        0.822000
>> >         177         0         0         0         0         0
>> > 0         0         0         0
>> >        GridNearAtomicSingleUpdateRequest         124        4.803000
>> >         124         0         0         0         0         0
>> > 0         0         0         0
>> >             GridNearAtomicUpdateResponse         120        0.448000
>> >         119         1         0         0         0         0
>> > 0         0         0         0
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >                                           1544912252
>> >  1542613019   2071099     57073     33387     61237     66024
>> > 9091       483        52       787
>> >
>> >
>> > This will allow to understand cluster load type and presence of problems
>> > with increased latency in message processing. In general all messages
>> > should be processed within time < 30ms.
>> >
>> > 2. Log messages which are waiting too long or have been processed for
>> too
>> > long (this is configured threshold).
>> >
>> > Track the following: message enqueue time, waiting in queue until
>> > processing is started, head of queue before starting to wait, queue size
>> > before starting to wait, total waiting time, total processing time,
>> queue
>> > size after processing, message content. Example:
>> >
>> > Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048,
>> > *procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0,
>> > *headMessageId*=null,
>> > *queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1,
>> > mvccSnapshot=null, super=GridDistributedTxFinishRequest
>> > [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0],
>> > futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296,
>> > commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0,
>> > sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb,
>> > taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl
>> > [activeCacheIds=[644280847], recovery=false, mvccEnabled=false,
>> > txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8,
>> > hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey
>> > [key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true],
>> > cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP,
>> val=null],
>> > oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
>> > conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
>> > filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false,
>> > entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8,
>> > hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0,
>> > nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras
>> > [obsoleteVer=GridCacheVersion [topVer=2147483647, order=0,
>> nodeOrder=0]],
>> > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
>> > [rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false,
>> > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>> > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
>> > [topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry
>> > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true],
>> > cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11,
>> > val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null],
>> > prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
>> > entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1,
>> conflictVer=null,
>> > explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [],
>> > filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry
>> > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null,
>> > ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11,
>> > extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion
>> > [topVer=2147483647, order=0, nodeOrder=0]],
>> > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
>> > [rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false,
>> > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>> > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
>> > [topVer=154800617, order=1543320621588, nodeOrder=1]]]],
>> > super=GridDistributedBaseMessage [ver=GridCacheVersion
>> [topVer=154800617,
>> > order=1543320621587, nodeOrder=4], committedVers=null,
>> rolledbackVers=null,
>> > cnt=0, super=GridCacheIdMessage [cacheId=0]]]]
>> >
>> > This will allow to understand type of activity which produces latency
>> > issues and possibly identify culprit code.
>> >
>> > 3. Track all synchronous waits while processing message: checkpoint
>> locks,
>> > segment locks, page locks, cold page reads from disk, etc.
>> >
>> > 4. Allow extensions to collection additional statistics for specific
>> > message type. On example, for *TxFinishRequest we could track total
>> > entities written for each cache and size in bytes. For
>> GridNearLockRequest
>> > we could track keys with max lock queue length for identifying
>> contention.
>> >
>> > 5. Implement additional logging, JMX metrics and control.sh command for
>> > printing currently collected profiling data.
>> >
>> > 6. Implementation should be as lightweight as possible.
>> >
>> > Maybe something else.
>> >
>> > BTW, I already started working on it and want to contribute my work to
>> > community. Ticket with PR (not finished) [1], check for
>> > CacheMessageStatsTest and CacheMessageStatsIndexingTest
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-10418
>> >
>> > Thoughts ?
>> >
>> > --
>> >
>> > Best regards,
>> > Alexei Scherbakov
>> >
>>
>>
>> --
>> Best regards,
>> Alexey
>>
>

Re: Lightweight profiling of messages processing

Reply via email to