> > Sorry for the slow reply, it's been crunch time on the 1.1 freeze... >
Not a problem--thanks for the response! What's a good starting point to get a feel for what you've added? Is > it PBSTracker? > PBSTracker is indeed a good place to start. The class stores and processes the latencies we care about for PBS. nodetool simply calls into the get*latencies() methods, while the ResponseHandlers call the startOperation and log{Read/Write}Response methods. There's nothing too magical. The PBS analysis code is in pbs/analyze_pbs.py and pbs/pbs_utils.py, which we kept separate for patch readability but could easily rewrite in Java as part of nodetool or similar. Is this different conceptually from something like > https://issues.apache.org/jira/browse/CASSANDRA-1123, other than that > obviously you're specifically concerned with PBS-related metrics? > It doesn't appear that the Cassandra-specific tweaks we've made are conceptually different from the patch you link to. Our patch performs coarser granularity measurements than the CASSANDRA-1123 patch, splitting the each per-replica operation time into (time spent sending the message+processing it at the replica) and (time spent waiting for a response). An important difference between the two patches is that we determine the latter latency at the coordinator by having the replica store the acknowledgement creation time in the acknowledgement itself; it looks like the patch you linked logs this creation time locally, requiring some distributed log parsing to reconstruct the latencies. This reconstruction is definitely doable. The trade-off is between space in each message required for the timestamp and complexity in log reconstruction. Thanks! Peter