Hi Alexei, Did you consider general-purpose APM <https://en.wikipedia.org/wiki/Application_performance_management> tools like free InspectIT <https://github.com/inspectIT/inspectIT> or commercial DynaTrace <https://www.dynatrace.com/> or AppDynamics <https://www.appdynamics.com/>?
Java APM tools do not require writing code or even instrumenting Ignite binaries: you just attach them as javaagent <https://zeroturnaround.com/rebellabs/how-to-inspect-classes-in-your-jvm/> to Ignite JVM and then you can configure "sensors" to track whatever API you want. APM tools collect metrics from the sensors and provide sophisticated analysis tools. DynaTrace claims they have 2%-7% overhead (depending on application) but you can always detach the tool if you do not always need it in production. I did not try APM for Ignite myself but it might work well. On Tue, Nov 27, 2018 at 4:37 PM Alexei Scherbakov < [email protected]> wrote: > Igniters, > > At work I often have to solve performance issues with Ignite cluster > without having access to source code of running user application. > > Looks like Ignite have limited capabilities to identify bottlenecks without > extensive profiling on server and client side (JFR recording , sampling > profilers, regular thread dumps, etc), which is not always possible. > > Even having profiling data not always helpful for determining several types > of bottlenecks, on example, if where is a contention on single > key/partition. > > I propose to implement new feature, like lightweight profiling of message > processing. > > The feature will allow to have view on message processing statistics for > each node and for all grid nodes. > > In short, it's necessary to track each message execution in executors and > record execution statistics like synchronous execution time in executor > thread and waiting time in queue. > > Full description: > > 1. Implement detailed tracking of message waiting in queue and actual > processing by executors with splitting to several time bins. Example of > detailed statistics for each processed message: > > > Processing time(%) > Message Total Average(ms) > < 1 ms < 10 ms < 30ms < 50ms < 100ms < 250ms < > 500ms < 750ms < 1000ms > 1000ms > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > GridNearSingleGetRequest 904311372 0.023000 > 904240521 57394 7242 3961 1932 229 > 61 24 4 4 > GridNearSingleGetResponse 340134416 0.041000 > 340118791 11660 1167 729 901 1001 > 158 8 1 0 > GridNearLockRequest 77088689 0.079000 > 77073458 11945 2299 643 311 31 > 2 0 0 0 > GridNearAtomicSingleUpdateInvokeRequest 39645752 0.298000 > 39580914 28222 6469 4638 9870 13414 > 2087 137 1 0 > GridDhtAtomicSingleUpdateRequest 37636829 0.277000 > 37579375 23247 5915 4210 8954 12917 > 2048 162 1 0 > GridDhtAtomicDeferredUpdateResponse 33580198 0.002000 > 33579805 337 51 3 1 1 > 0 0 0 0 > GridNearTxPrepareRequest 21667790 0.238000 > 21078069 580126 1622 1261 2531 3631 > 496 40 0 14 > GridDhtTxPrepareResponse 20949873 0.316000 > 17130161 3803105 4615 3162 4489 3721 > 577 34 1 8 > GridDhtTxPrepareRequest 20949838 0.501000 > 16158732 4750217 16183 5735 8472 8994 > 1353 88 11 53 > GridDhtTxFinishResponse 13835065 0.007000 > 13832519 2476 27 28 14 1 > 0 0 0 0 > GridDhtTxFinishRequest 13835028 0.547000 > 12084106 1736789 8971 2340 1792 807 > 118 41 4 60 > GridNearTxFinishRequest 13762197 0.725000 > 11811828 1942499 4441 1400 1201 524 > 89 34 19 162 > GridDhtAtomicNearResponse 2784422 0.122000 > 2783393 1022 5 2 0 0 > 0 0 0 0 > GridNearGetRequest 2360483 0.484000 > 2345937 14129 244 101 64 8 > 0 0 0 0 > GridNearGetResponse 1984243 0.054000 > 1981905 2327 8 1 1 1 > 0 0 0 0 > GridNearTxPrepareResponse 192856 0.153000 > 192660 188 1 5 1 1 > 0 0 0 0 > GridNearLockResponse 192780 0.091000 > 192667 107 3 0 3 0 > 0 0 0 0 > GridNearTxFinishResponse 177 0.822000 > 129 47 1 0 0 0 > 0 0 0 0 > GridNearAtomicSingleUpdateRequest 124 4.803000 > 52 53 19 0 0 0 > 0 0 0 0 > GridNearAtomicUpdateResponse 120 0.448000 > 110 10 0 0 0 0 > 0 0 0 0 > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > 1544912252 > 1531765132 12965900 59283 28219 40537 45281 > 6989 568 42 301 > > > Queue waiting time(%) > Message Total Average(ms) > < 1 ms < 10 ms < 30ms < 50ms < 100ms < 250ms < > 500ms < 750ms < 1000ms > 1000ms > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > GridNearSingleGetRequest 904311372 0.023000 > 903130400 1051499 33248 18049 34044 38809 > 4714 212 8 389 > GridNearSingleGetResponse 340134416 0.041000 > 339708055 419479 3732 1004 1222 737 > 117 2 1 67 > GridNearLockRequest 77088689 0.079000 > 76949792 129666 2003 1624 2585 2531 > 390 18 2 78 > GridNearAtomicSingleUpdateInvokeRequest 39645752 0.298000 > 39572801 53913 3949 2960 5487 5583 > 963 45 5 46 > GridDhtAtomicSingleUpdateRequest 37636829 0.277000 > 37561043 57294 3721 2736 5339 5716 > 879 54 6 41 > GridDhtAtomicDeferredUpdateResponse 33580198 0.002000 > 33512734 45873 4427 3340 6342 6363 > 1033 50 10 26 > GridNearTxPrepareRequest 21667790 0.238000 > 21584416 76283 1681 1026 1948 2027 > 330 31 2 46 > GridDhtTxPrepareResponse 20949873 0.316000 > 20862632 83578 1214 633 821 817 > 118 26 8 26 > GridDhtTxPrepareRequest 20949838 0.501000 > 20867551 72264 2265 1516 2773 2905 > 479 38 6 41 > GridDhtTxFinishResponse 13835065 0.007000 > 13808274 26487 150 68 55 21 > 3 1 0 6 > GridDhtTxFinishRequest 13835028 0.547000 > 13809380 25142 188 127 102 59 > 9 2 4 15 > GridNearTxFinishRequest 13762197 0.725000 > 13743925 17365 241 171 278 191 > 16 4 0 6 > GridDhtAtomicNearResponse 2784422 0.122000 > 2780600 3767 29 9 7 9 > 1 0 0 0 > GridNearGetRequest 2360483 0.484000 > 2355963 3762 194 114 184 228 > 38 0 0 0 > GridNearGetResponse 1984243 0.054000 > 1979404 4721 31 9 49 28 > 1 0 0 0 > GridNearTxPrepareResponse 192856 0.153000 > 192852 2 0 1 1 0 > 0 0 0 0 > GridNearLockResponse 192780 0.091000 > 192777 3 0 0 0 0 > 0 0 0 0 > GridNearTxFinishResponse 177 0.822000 > 177 0 0 0 0 0 > 0 0 0 0 > GridNearAtomicSingleUpdateRequest 124 4.803000 > 124 0 0 0 0 0 > 0 0 0 0 > GridNearAtomicUpdateResponse 120 0.448000 > 119 1 0 0 0 0 > 0 0 0 0 > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > 1544912252 > 1542613019 2071099 57073 33387 61237 66024 > 9091 483 52 787 > > > This will allow to understand cluster load type and presence of problems > with increased latency in message processing. In general all messages > should be processed within time < 30ms. > > 2. Log messages which are waiting too long or have been processed for too > long (this is configured threshold). > > Track the following: message enqueue time, waiting in queue until > processing is started, head of queue before starting to wait, queue size > before starting to wait, total waiting time, total processing time, queue > size after processing, message content. Example: > > Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048, > *procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0, > *headMessageId*=null, > *queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1, > mvccSnapshot=null, super=GridDistributedTxFinishRequest > [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], > futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296, > commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0, > sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb, > taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl > [activeCacheIds=[644280847], recovery=false, mvccEnabled=false, > txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8, > hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true], > cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP, val=null], > oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1, > conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, > filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false, > entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8, > hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, > nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras > [obsoleteVer=GridCacheVersion [topVer=2147483647, order=0, nodeOrder=0]], > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry > [rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false, > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion > [topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], > cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11, > val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null], > prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null], > entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, > explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [], > filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null, > ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11, > extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion > [topVer=2147483647, order=0, nodeOrder=0]], > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry > [rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false, > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion > [topVer=154800617, order=1543320621588, nodeOrder=1]]]], > super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=154800617, > order=1543320621587, nodeOrder=4], committedVers=null, rolledbackVers=null, > cnt=0, super=GridCacheIdMessage [cacheId=0]]]] > > This will allow to understand type of activity which produces latency > issues and possibly identify culprit code. > > 3. Track all synchronous waits while processing message: checkpoint locks, > segment locks, page locks, cold page reads from disk, etc. > > 4. Allow extensions to collection additional statistics for specific > message type. On example, for *TxFinishRequest we could track total > entities written for each cache and size in bytes. For GridNearLockRequest > we could track keys with max lock queue length for identifying contention. > > 5. Implement additional logging, JMX metrics and control.sh command for > printing currently collected profiling data. > > 6. Implementation should be as lightweight as possible. > > Maybe something else. > > BTW, I already started working on it and want to contribute my work to > community. Ticket with PR (not finished) [1], check for > CacheMessageStatsTest and CacheMessageStatsIndexingTest > > [1] https://issues.apache.org/jira/browse/IGNITE-10418 > > Thoughts ? > > -- > > Best regards, > Alexei Scherbakov > -- Best regards, Alexey
