BTW one other thing that I have not been able to debug today that maybe someone can help me with:
I am using a three-node Cassandra cluster with Vagrant. The nodes in my cluster are 192.168.200.11, 192.168.200.12, and 192.168.200.13. If I use cqlsh to connect to 192.168.200.11, I see unique sets of tokens when I run the following three commands: select tokens from system.local select tokens from system.peers where peer=192.168.200.12 select tokens from system.peers where peer=192.168.200.13 This is what I expect. However, when I tried making an application with the Java driver that does the following: - Create a Session by connecting to 192.168.200.11 - From that session, "select tokens from system.local" - From that session, "select tokens, peer from system.peers" Now I get the exact-same set of tokens from system.local and from the row in system.peers in which peer=192.168.200.13. Anyone have any idea why this would happen? I'm not sure how to debug this. I see the following log from the Java driver: 14/03/30 19:05:24 DEBUG com.datastax.driver.core.Cluster: Starting new cluster with contact points [/192.168.200.11] 14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra host /192.168.200.13 added 14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra host /192.168.200.12 added I'm running Cassandra 2.0.6 in the virtual machine and I built my application with version 2.0.1 of the driver. Best regards, Clint On Sun, Mar 30, 2014 at 4:51 PM, Clint Kelly <clint.ke...@gmail.com> wrote: > Hi all, > > > I am working on a Hadoop InputFormat implementation that uses only the > native protocol Java driver and not the Thrift API. I am currently trying > to replicate some of the behavior of > *Cassandra.client.describe_ring(myKeyspace)* from the Thrift API. I > would like to do the following: > > - Get a list of all of the token ranges for a cluster > - For every token range, determine the replica nodes on which the data > in the token range resides > - Estimate the number of rows for every range of tokens > - Groups ranges of tokens on common replica nodes such that we can > create a set of input splits for Hadoop with total estimated line counts > that are reasonably close to the requested split size > > Last week I received some much-appreciated help on this list that pointed > me to using the system.peers table to get the list of token ranges for the > cluster and the corresponding hosts. Today I created a three-node C* > cluster in Vagrant (https://github.com/dholbrook/vagrant-cassandra) and > tried inspecting some of the system tables. I have a couple of questions > now: > > 1. *How many total unique tokens should I expect to see in my cluster?* > If I have three nodes, and each node has a cassandra.yaml with num_tokens = > 256, then should I expect a total of 256*3 = 768 distinct vnodes? > > 2. *How does the creation of vnodes and their assignment to nodes relate > to the replication factor for a given keyspace?* I never thought about > this until today, and I tried to reread the documentation on virtual nodes, > replication in Cassandra, etc., and now I am sadly still confused. Here is > what I think I understand. :) > > - Given a row with a partition key, any client request for an > operation on that row will go to a coordinator node in the cluster. > - The coordinator node will compute the token value for the row and > from that determine a set of replica nodes for that token. > - One of the replica nodes I assume is the node that "owns" the > vnode with the token range that encompasses the token > - The identity of the "owner" of this virtual node is a > cross-keyspace property > - And the other replicas were originally chosen based on the > replica-placement strategy > - And therefore the other replicas will be different for each > keyspace (because replication factors and replica-placement strategy are > properties of a keyspace) > > 3. What do the values in the "token" column in system.peers and > system.local refer to then? > > - Since these tables appear to be global, and not per-keyspace > properties, I assume that they don't have any information about replication > in them, is that correct? > - If I have three nodes in my cluster, 256 vnodes per node, and I'm > using the Murmur3 partitioner, should I then expect to see the values of > "tokens" in system.peers and system.local be 768 evenly-distributed values > between -2^63 and 2^63? > > 4. Is there any other way, without using Thift, to get as much information > as possible about what nodes contain replicas of data for all of the token > ranges in a given cluster? > > I really appreciate any help, thanks! > > Best regards, > Clint >