Great addition! Thanks Marcus. +1 for cassandra-compare as said by Jeremy.
We can also think about other features like: - Comparing just the count between 2 tables. In some cases, It will be enough to say that our copy is OK. - Making a difference on a set of partition ==> This will avoid comparing the full of data in case of large volumes and when a set of data will be enough to be sure of our copy. Thanks Le jeu. 22 août 2019 à 09:49, Jeremy Hanna <jeremy.hanna1...@gmail.com> a écrit : > It’s great to contribute such a tool. The change between 2.x and 3.0 > brought a translation layer from thrift to cql that is hard to validate on > real clusters without something like this. Thank you. > > As for naming, perhaps cassandra-compare might be clearer as diff is an > overloaded word but that’s a bikeshed sort of argument. > > > On Aug 22, 2019, at 12:32 AM, Vinay Chella <vinaykumar...@gmail.com> > wrote: > > > > This is a great addition to our Cassandra validation framework/tools. I > can > > see many teams in the community get benefited from tooling like this. > > > > I like the idea of the generic repo (repos/asf/cassandra-contrib.git > > or *whatever > > the name is*) for tools like this, for the following 2 main reasons. > > > > 1. Easily accessible/ reachable/ searchable > > 2. Welcomes community in Cassandra ecosystem to contribute more easily > > > > > > > > Thanks, > > Vinay Chella > > > > > >> On Wed, Aug 21, 2019 at 11:39 PM Marcus Eriksson <marc...@apache.org> > wrote: > >> > >> Hi, we are about to open source our tooling for comparing two cassandra > >> clusters and want to get some feedback where to push it. I think the > >> options are: (name bike-shedding welcome) > >> > >> 1. create repos/asf/cassandra-diff.git > >> 2. create a generic repos/asf/cassandra-contrib.git where we can add > more > >> contributed tools in the future > >> > >> Temporary location: https://github.com/krummas/cassandra-diff > >> > >> Cassandra-diff is a spark job that compares the data in two clusters - > it > >> pages through all partitions and reads all rows for those partitions in > >> both clusters to make sure they are identical. Based on the > configuration > >> variable “reverse_read_probability” the rows are either read forward or > in > >> reverse order. > >> > >> Our main use case for cassandra-diff has been to set up two identical > >> clusters, transfer a snapshot from the cluster we want to test to these > >> clusters and upgrade one side. When that is done we run this tool to > make > >> sure that 2.1 and 3.0 gives the same results. A few examples of the > bugs we > >> have found using this tool: > >> > >> * CASSANDRA-14823: Legacy sstables with range tombstones spanning > multiple > >> index blocks create invalid bound sequences on 3.0+ > >> * CASSANDRA-14803: Rows that cross index block boundaries can cause > >> incomplete reverse reads in some cases > >> * CASSANDRA-15178: Skipping illegal legacy cells can break reverse > >> iteration of indexed partitions > >> > >> /Marcus > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > >> For additional commands, e-mail: dev-h...@cassandra.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > > -- Cordialement; Ahmed ELJAMI