Dear Cassandra development team, We are computer science researchers at the University of Chicago. Our research is about the reliability of cloud-scale distributed systems. Samples of our work can be found here: http://ucare.cs.uchicago.edu <http://ucare.cs.uchicago.edu/>
We are reaching out to you because we are interested in reproducing any unsolved scalability bugs in Cassandra. We define scalability bugs as latent bugs that are scale-dependent. They don't arise in small-scale deployment but arise in large-scale production runs. For example, everything is fine in 100-node deployment but in 500-node deployment the bug appears. We have created a scale-check methodology (SLCK) that can unearth scalability bugs in a single machine. With SLCK, we can run hundreds of nodes on a single machine and reproduce some old scalability bugs. For example, we have reproduced the following bugs in one machine: - https://issues.apache.org/jira/browse/CASSANDRA-6127 <https://issues.apache.org/jira/browse/CASSANDRA-6127> (a customer observed node flapping when bootstrapping 1000 nodes) - https://issues.apache.org/jira/browse/CASSANDRA-3831 <https://issues.apache.org/jira/browse/CASSANDRA-3831> We are submitting SLCK for publication soon, and we can send you a draft a month from now if you are interested. To make a stronger publication submission, beyond reproducing old bugs, we thought it would be great if SLCK can reproduce new scalability bugs (if any) that you are still trying to resolve. We hope you find our work interesting and we would really appreciate if you can point to us any new scalability bugs that hopefully we can help you reproduce. Thank you very much for your attention! Best, Tanakorn L.