Hi Alexei,

Did you consider general-purpose APM
<https://en.wikipedia.org/wiki/Application_performance_management> tools
like free InspectIT <https://github.com/inspectIT/inspectIT> or commercial
DynaTrace <https://www.dynatrace.com/> or AppDynamics
<https://www.appdynamics.com/>?

Java APM tools do not require writing code or even instrumenting Ignite
binaries: you just attach them as javaagent
<https://zeroturnaround.com/rebellabs/how-to-inspect-classes-in-your-jvm/> to
Ignite JVM and then you can configure "sensors" to track whatever API you
want. APM tools collect metrics from the sensors and provide sophisticated
analysis tools.

DynaTrace claims they have 2%-7% overhead (depending on application) but
you can always detach the tool if you do not always need it in production.

I did not try APM for Ignite myself but it might work well.

On Tue, Nov 27, 2018 at 4:37 PM Alexei Scherbakov <
[email protected]> wrote:

> Igniters,
>
> At work I often have to solve performance issues with Ignite cluster
> without having access to source code of running user application.
>
> Looks like Ignite have limited capabilities to identify bottlenecks without
> extensive profiling on server and client side (JFR recording , sampling
> profilers, regular thread dumps,  etc), which is not always possible.
>
> Even having profiling data not always helpful for determining several types
> of bottlenecks, on example, if where is a contention on single
> key/partition.
>
> I propose to implement new feature, like lightweight profiling of message
> processing.
>
> The feature will allow to have view on message processing statistics for
> each node and for all grid nodes.
>
> In short, it's necessary to track each message execution in executors and
> record execution statistics like synchronous execution time in executor
> thread and waiting time in queue.
>
> Full description:
>
> 1. Implement detailed tracking of message waiting in queue and actual
> processing by executors with splitting to several time bins. Example of
> detailed statistics for each processed message:
>
>
> Processing time(%)
>                                  Message       Total     Average(ms)
>      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
> 500ms   < 750ms  < 1000ms  > 1000ms
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                 GridNearSingleGetRequest   904311372        0.023000
>   904240521     57394      7242      3961      1932       229
> 61        24         4         4
>                GridNearSingleGetResponse   340134416        0.041000
>   340118791     11660      1167       729       901      1001
> 158         8         1         0
>                      GridNearLockRequest    77088689        0.079000
>    77073458     11945      2299       643       311        31
> 2         0         0         0
>  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
>    39580914     28222      6469      4638      9870     13414
> 2087       137         1         0
>         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
>    37579375     23247      5915      4210      8954     12917
> 2048       162         1         0
>      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
>    33579805       337        51         3         1         1
> 0         0         0         0
>                 GridNearTxPrepareRequest    21667790        0.238000
>    21078069    580126      1622      1261      2531      3631
> 496        40         0        14
>                 GridDhtTxPrepareResponse    20949873        0.316000
>    17130161   3803105      4615      3162      4489      3721
> 577        34         1         8
>                  GridDhtTxPrepareRequest    20949838        0.501000
>    16158732   4750217     16183      5735      8472      8994
> 1353        88        11        53
>                  GridDhtTxFinishResponse    13835065        0.007000
>    13832519      2476        27        28        14         1
> 0         0         0         0
>                   GridDhtTxFinishRequest    13835028        0.547000
>    12084106   1736789      8971      2340      1792       807
> 118        41         4        60
>                  GridNearTxFinishRequest    13762197        0.725000
>    11811828   1942499      4441      1400      1201       524
> 89        34        19       162
>                GridDhtAtomicNearResponse     2784422        0.122000
>     2783393      1022         5         2         0         0
> 0         0         0         0
>                       GridNearGetRequest     2360483        0.484000
>     2345937     14129       244       101        64         8
> 0         0         0         0
>                      GridNearGetResponse     1984243        0.054000
>     1981905      2327         8         1         1         1
> 0         0         0         0
>                GridNearTxPrepareResponse      192856        0.153000
>      192660       188         1         5         1         1
> 0         0         0         0
>                     GridNearLockResponse      192780        0.091000
>      192667       107         3         0         3         0
> 0         0         0         0
>                 GridNearTxFinishResponse         177        0.822000
>         129        47         1         0         0         0
> 0         0         0         0
>        GridNearAtomicSingleUpdateRequest         124        4.803000
>          52        53        19         0         0         0
> 0         0         0         0
>             GridNearAtomicUpdateResponse         120        0.448000
>         110        10         0         0         0         0
> 0         0         0         0
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                                           1544912252
>  1531765132  12965900     59283     28219     40537     45281
> 6989       568        42       301
>
>
> Queue waiting time(%)
>                                  Message       Total     Average(ms)
>      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
> 500ms   < 750ms  < 1000ms  > 1000ms
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                 GridNearSingleGetRequest   904311372        0.023000
>   903130400   1051499     33248     18049     34044     38809
> 4714       212         8       389
>                GridNearSingleGetResponse   340134416        0.041000
>   339708055    419479      3732      1004      1222       737
> 117         2         1        67
>                      GridNearLockRequest    77088689        0.079000
>    76949792    129666      2003      1624      2585      2531
> 390        18         2        78
>  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
>    39572801     53913      3949      2960      5487      5583
> 963        45         5        46
>         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
>    37561043     57294      3721      2736      5339      5716
> 879        54         6        41
>      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
>    33512734     45873      4427      3340      6342      6363
> 1033        50        10        26
>                 GridNearTxPrepareRequest    21667790        0.238000
>    21584416     76283      1681      1026      1948      2027
> 330        31         2        46
>                 GridDhtTxPrepareResponse    20949873        0.316000
>    20862632     83578      1214       633       821       817
> 118        26         8        26
>                  GridDhtTxPrepareRequest    20949838        0.501000
>    20867551     72264      2265      1516      2773      2905
> 479        38         6        41
>                  GridDhtTxFinishResponse    13835065        0.007000
>    13808274     26487       150        68        55        21
> 3         1         0         6
>                   GridDhtTxFinishRequest    13835028        0.547000
>    13809380     25142       188       127       102        59
> 9         2         4        15
>                  GridNearTxFinishRequest    13762197        0.725000
>    13743925     17365       241       171       278       191
> 16         4         0         6
>                GridDhtAtomicNearResponse     2784422        0.122000
>     2780600      3767        29         9         7         9
> 1         0         0         0
>                       GridNearGetRequest     2360483        0.484000
>     2355963      3762       194       114       184       228
> 38         0         0         0
>                      GridNearGetResponse     1984243        0.054000
>     1979404      4721        31         9        49        28
> 1         0         0         0
>                GridNearTxPrepareResponse      192856        0.153000
>      192852         2         0         1         1         0
> 0         0         0         0
>                     GridNearLockResponse      192780        0.091000
>      192777         3         0         0         0         0
> 0         0         0         0
>                 GridNearTxFinishResponse         177        0.822000
>         177         0         0         0         0         0
> 0         0         0         0
>        GridNearAtomicSingleUpdateRequest         124        4.803000
>         124         0         0         0         0         0
> 0         0         0         0
>             GridNearAtomicUpdateResponse         120        0.448000
>         119         1         0         0         0         0
> 0         0         0         0
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                                           1544912252
>  1542613019   2071099     57073     33387     61237     66024
> 9091       483        52       787
>
>
> This will allow to understand cluster load type and presence of problems
> with increased latency in message processing. In general all messages
> should be processed within time < 30ms.
>
> 2. Log messages which are waiting too long or have been processed for too
> long (this is configured threshold).
>
> Track the following: message enqueue time, waiting in queue until
> processing is started, head of queue before starting to wait, queue size
> before starting to wait, total waiting time, total processing time, queue
> size after processing, message content. Example:
>
> Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048,
> *procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0,
> *headMessageId*=null,
> *queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1,
> mvccSnapshot=null, super=GridDistributedTxFinishRequest
> [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0],
> futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296,
> commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0,
> sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb,
> taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl
> [activeCacheIds=[644280847], recovery=false, mvccEnabled=false,
> txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8,
> hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey
> [key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true],
> cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
> oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
> conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
> filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false,
> entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8,
> hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0,
> nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras
> [obsoleteVer=GridCacheVersion [topVer=2147483647, order=0, nodeOrder=0]],
> flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
> [rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false,
> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
> [topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry
> [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true],
> cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11,
> val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null],
> prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [],
> filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry
> [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null,
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11,
> extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion
> [topVer=2147483647, order=0, nodeOrder=0]],
> flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
> [rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false,
> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
> [topVer=154800617, order=1543320621588, nodeOrder=1]]]],
> super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=154800617,
> order=1543320621587, nodeOrder=4], committedVers=null, rolledbackVers=null,
> cnt=0, super=GridCacheIdMessage [cacheId=0]]]]
>
> This will allow to understand type of activity which produces latency
> issues and possibly identify culprit code.
>
> 3. Track all synchronous waits while processing message: checkpoint locks,
> segment locks, page locks, cold page reads from disk, etc.
>
> 4. Allow extensions to collection additional statistics for specific
> message type. On example, for *TxFinishRequest we could track total
> entities written for each cache and size in bytes. For GridNearLockRequest
> we could track keys with max lock queue length for identifying contention.
>
> 5. Implement additional logging, JMX metrics and control.sh command for
> printing currently collected profiling data.
>
> 6. Implementation should be as lightweight as possible.
>
> Maybe something else.
>
> BTW, I already started working on it and want to contribute my work to
> community. Ticket with PR (not finished) [1], check for
> CacheMessageStatsTest and CacheMessageStatsIndexingTest
>
> [1] https://issues.apache.org/jira/browse/IGNITE-10418
>
> Thoughts ?
>
> --
>
> Best regards,
> Alexei Scherbakov
>


-- 
Best regards,
Alexey

Reply via email to