RF 6??? Well, the traffic is routing to the other 6 nodes, which likeky can serve the traffic with the super high RF, and the newly restarted node not seeing the traffic until gossip settles?
On Fri, Sep 6, 2024, 2:54 PM Jeff Jirsa <jji...@gmail.com> wrote: > The unfortunate reality here is I don’t think anyone is going to be able > to answer with the data provided. > > Are the disk IOPS from cassandra reads? Or compaction? Or repair? Do they > ramp with client reads (is that curve matching your customer traffic?)? > Are they from client data reads or from internal reads (e.g. schema and > auth from client reconnects)? > Are they the first reads, or read repair? > > If this were my cluster, I’d be looking at the rest of the graphs to try > to tell what “else” was happening beyond high read IOPS. If nothing stood > out, I would have taken a stack trace to try to see what those nodes were > doing at the time, vs what they’re doing “normally”. > > > On Sep 6, 2024, at 12:29 PM, Pradeep Badiger <pradeepbadi...@fico.com> > wrote: > > Thanks, Jeff. We use QUORUM consistency for reads and writes. Even we are > clueless as to why such an issue could occur. Do you think restarting again > and running the full repair on the node would help? > > *From:* Jeff Jirsa <jji...@gmail.com> > *Sent:* Friday, September 6, 2024 2:03 PM > *To:* cassandra <user@cassandra.apache.org> > *Cc:* Pradeep Badiger <pradeepbadi...@fico.com> > *Subject:* [EXTERNAL] Re: Cassandra 3.11 - below normal disk read after > restart > > > *CAUTION:* This email originated from outside the organization. Do not > click links or open attachments unless you recognize the sender and know > the content is safe. > If they went up by 1/7th, could potentially assume it was something > related to the snitch not choosing the restarted host. They went up by a > lot (2-3x?). What consistency level do you use for reads and writes, and do > you have graphs for local reads / hint delivery? (I’m GUESSING that you’re > seeing extra read repair or some other multiplier kick in, but it doesn’t > make a lot of sense to be honest). > > > > > On Sep 6, 2024, at 9:47 AM, Pradeep Badiger via user < > user@cassandra.apache.org> wrote: > > Hi, > > We are using Cassandra 3.11 with a cluster of 7 nodes and replication of 6 > with most of the default configurations. During a recent maintenance > window, one of the nodes was restarted. The node came back up normal, with > no errors of any sort. But when the application started using the cluster, > we found *below-normal* disk io read rates on the node that was > restarted, and other nodes in the cluster reported *above-normal* disk io > read rates. This difference became significant causing alerts to get > reported by the monitoring system. As a measure to resolve the issue the > application was stopped and the entire cluster was restarted after which > all 7 nodes reported almost the same read rates. > > <image001.png> > > *Figure 1 - After the node 53 was restarted.* > > <image002.png> > > *Figure 2 - After the entire cluster restart.* > The node in question was not down for a very long time. Is there any > specific reason the read rates would differ like this? Is there a way to > resolve this without restarting the entire cluster? > > Thanks, > Pradeep V.B. > This email and any files transmitted with it are confidential, proprietary > and intended solely for the individual or entity to whom they are > addressed. If you have received this email in error please delete it > immediately. > > > This email and any files transmitted with it are confidential, proprietary > and intended solely for the individual or entity to whom they are > addressed. If you have received this email in error please delete it > immediately. > > >