The unfortunate reality here is I don’t think anyone is going to be able to answer with the data provided.
Are the disk IOPS from cassandra reads? Or compaction? Or repair? Do they ramp with client reads (is that curve matching your customer traffic?)? Are they from client data reads or from internal reads (e.g. schema and auth from client reconnects)? Are they the first reads, or read repair? If this were my cluster, I’d be looking at the rest of the graphs to try to tell what “else” was happening beyond high read IOPS. If nothing stood out, I would have taken a stack trace to try to see what those nodes were doing at the time, vs what they’re doing “normally”. > On Sep 6, 2024, at 12:29 PM, Pradeep Badiger <pradeepbadi...@fico.com> wrote: > > Thanks, Jeff. We use QUORUM consistency for reads and writes. Even we are > clueless as to why such an issue could occur. Do you think restarting again > and running the full repair on the node would help? > > From: Jeff Jirsa <jji...@gmail.com <mailto:jji...@gmail.com>> > Sent: Friday, September 6, 2024 2:03 PM > To: cassandra <user@cassandra.apache.org <mailto:user@cassandra.apache.org>> > Cc: Pradeep Badiger <pradeepbadi...@fico.com <mailto:pradeepbadi...@fico.com>> > Subject: [EXTERNAL] Re: Cassandra 3.11 - below normal disk read after restart > > CAUTION: This email originated from outside the organization. Do not click > links or open attachments unless you recognize the sender and know the > content is safe. > > If they went up by 1/7th, could potentially assume it was something related > to the snitch not choosing the restarted host. They went up by a lot (2-3x?). > What consistency level do you use for reads and writes, and do you have > graphs for local reads / hint delivery? (I’m GUESSING that you’re seeing > extra read repair or some other multiplier kick in, but it doesn’t make a lot > of sense to be honest). > > > > > On Sep 6, 2024, at 9:47 AM, Pradeep Badiger via user > <user@cassandra.apache.org <mailto:user@cassandra.apache.org>> wrote: > > Hi, > > We are using Cassandra 3.11 with a cluster of 7 nodes and replication of 6 > with most of the default configurations. During a recent maintenance window, > one of the nodes was restarted. The node came back up normal, with no errors > of any sort. But when the application started using the cluster, we found > below-normal disk io read rates on the node that was restarted, and other > nodes in the cluster reported above-normal disk io read rates. This > difference became significant causing alerts to get reported by the > monitoring system. As a measure to resolve the issue the application was > stopped and the entire cluster was restarted after which all 7 nodes reported > almost the same read rates. > > <image001.png> > Figure 1 - After the node 53 was restarted. > > > <image002.png> > Figure 2 - After the entire cluster restart. > > The node in question was not down for a very long time. Is there any specific > reason the read rates would differ like this? Is there a way to resolve this > without restarting the entire cluster? > > Thanks, > Pradeep V.B. > This email and any files transmitted with it are confidential, proprietary > and intended solely for the individual or entity to whom they are addressed. > If you have received this email in error please delete it immediately. > > This email and any files transmitted with it are confidential, proprietary > and intended solely for the individual or entity to whom they are addressed. > If you have received this email in error please delete it immediately.