Hi! > ... > Again, it's probably a bad idea. I agree on that, now.
Thank you. > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 11/01/2012, at 4:56 AM, Roland Gude wrote: > >> >> Each node in the cluster is assigned a token (can be done automatically – >> but usually should not) >> The token of a node is the start token of the partition it is responsible >> for (and the token of the next node is the end token of the current tokens >> partition) >> >> Assume you have the following nodes/tokens (which are usually numbers but >> for the example I will use letters) >> >> N1/A >> N2/D >> N3/M >> N4/X >> >> This means that N1 is responsible (primary) for [A-D) >> N2 for [D-M) >> N3 for [M-X) >> And N4 for [X-A) >> >> If you have a replication factor of 1 data will go on the nodes like this: >> >> B -> N1 >> E->N2 >> X->N4 >> >> And so on >> If you have a higher replication factor, the placement strategy decides >> which node will take replicas of which partition (becoming secondary node >> for that partition) >> Simple strategy will just put the replica on the next node in the ring >> So same example as above but RF of 2 and simple strategy: >> >> B-> N1 and N2 >> E -> N2 and N3 >> X -> N4 and N1 >> >> Other strategies can factor in things like “put data in another datacenter” >> or “put data in another rack” or such things. >> >> Even though the terms primary and secondary imply some means of quality or >> consistency, this is not the case. If a node is responsible for a piece of >> data, it will store it. >> >> >> But placement of the replicas is usually only relevant for availability >> reasons (i.e. disaster recovery etc.) >> Actual location should mean nothing to most applications as you can ask any >> node for the data you want and it will provide it to you (fetching it from >> the responsible nodes). >> This should be sufficient in almost all cases. >> >> So in the above example again, you can ask N3 “what data is available” and >> it will tell you: B, E and X, or you could ask it “give me X” and it will >> fetch it from N4 or N1 or both of them depending on consistency >> configuration and return the data to you. >> >> >> So actually if you use Cassandra – for the application the actual storage >> location of the data should not matter. It will be available anywhere in the >> cluster if it is stored on any reachable node. >> >> Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] >> Gesendet: Dienstag, 10. Januar 2012 15:06 >> An: user@cassandra.apache.org >> Betreff: Re: AW: How to control location of data? >> >> Hi! >> >> Thank you for your last reply. I'm still wondering if I got you right... >> >> ... >> A partitioner decides into which partition a piece of data belongs >> Does your statement imply that the partitioner does not take any decisions >> at all on the (physical) storage location? Or put another way: What do you >> mean with "partition"? >> >> To quote http://wiki.apache.org/cassandra/ArchitectureInternals: "... >> AbstractReplicationStrategy controls what nodes get secondary, tertiary, >> etc. replicas of each key range. Primary replica is always determined by the >> token ring (...)" >> >> >> ... >> You can select different placement strategies and partitioners for different >> keyspaces, thereby choosing known data to be stored on known hosts. >> This is however discouraged for various reasons – i.e. you need a lot of >> knowledge about your data to keep the cluster balanced. What is your usecase >> for this requirement? there is probably a more suitable solution. >> >> What we want is to partition the cluster with respect to key spaces. >> That is we want to establish an association between nodes and key spaces so >> that a node of the cluster holds data from a key space if and only if that >> node is a *member* of that key space. >> >> To our knowledge Cassandra has no built-in way to specify such a >> membership-relation. Therefore we thought of implementing our own replica >> placement strategy until we started to assume that the partitioner had to be >> replaced, too, to accomplish the task. >> >> Do you have any ideas? >> >> >> >> Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] >> Gesendet: Dienstag, 10. Januar 2012 09:53 >> An: user@cassandra.apache.org >> Betreff: How to control location of data? >> >> Hi! >> >> We're evaluating Cassandra for our storage needs. One of the key benefits we >> see is the online replication of the data, that is an easy way to share data >> across nodes. But we have the need to precisely control on what node group >> specific parts of a key space (columns/column families) are stored on. Now >> we're having trouble understanding the documentation. Could anyone help us >> with to find some answers to our questions? >> >> · What does the term "replica" mean: If a key is stored on exactly three >> nodes in a cluster, is it correct then to say that there are three replicas >> of that key or are there just two replicas (copies) and one original? >> · What is the relation between the Cassandra concepts "Partitioner" and >> "Replica Placement Strategy"? According to documentation found on DataStax >> web site and architecture internals from the Cassandra Wiki the first >> storage location of a key (and its associated data) is determined by the >> "Partitioner" whereas additional storage locations are defined by "Replica >> Placement Strategy". I'm wondering if I could completely redefine the way >> how nodes are selected to store a key by just implementing my own subclass >> of AbstractReplicationStrategy and configuring that subclass into the key >> space. >> · How can I suppress that the "Partitioner" is consulted at all to >> determine what node stores a key first? >> · Is a key space always distributed across the whole cluster? Is it >> possible to configure Cassandra in such a way that more or less freely >> chosen parts of a key space (columns) are stored on arbitrarily chosen nodes? >> >> Any tips would be very appreciated :-) >> >