In one of our load tests, we're incrementing a single counter column as well as appending columns to a single row (essentially a timeline). You can think of it as counting the instances of an event and then keeping a timeline of those events. The ratio is of increments to "appends" is 1:1.
When we run this on a test cluster with RF = 3, one node gets backed up with a lot of replicate on write tasks pending, eventually maxing out at 4128. We think it's a disk I/O issue that's causing the slowdown (lot of reads), but we're still investigating. A few questions that might speed up understanding the issue: 1. Is there any way to see metadata about the replicate on write tasks pending? We're splitting apart the load test to pinpoint which of those operations is causing an issue, but if there's a way to see that queue, that might save us some work. 2. I'm assuming in our case the cause is incrementing counters because disk reads are part of the write path for counters and are not for appending columns to a row. Does that logic make sense? Thanks in advance, Andrew