On Mon, Feb 24, 2014 at 11:47 AM, Sylvain Lebresne <sylv...@datastax.com>wrote:
> >> >>> >>>> I still have some questions regarding the mapping. Please bear with me >>>> if these are stupid questions. I am quite new to Cassandra. >>>> >>>> The basic cassandra data model for a keyspace is something like this, >>>> right? >>>> >>>> SortedMap<byte[], SortedMap<byte[], Pair<Long, byte[]>> >>>> ^ row key. determines which server(s) the rest is >>>> stored on >>>> ^ column key >>>> ^ >>>> timestamp (latest one wins) >>>> >>>> ^ value (can be size 0) >>>> >>> >>> It's a reasonable way to think of how things are stored internally, yes. >>> Though as DuyHai mentioned, the first map is really sorting by token and in >>> general that means you use mostly the sorting of the second map concretely. >>> >>> >> Yes, understood. >> >> So the first SortedMap is sorted on some kind of hash of the actual key >> to make sure the data gets evenly distributed along the nodes? What if my >> key is already a good hash: is there a way to use an identity function as a >> hash function (in CQL)? >> > > It's possible, yes. The hash function we're talking about is what > Cassandra calls "the partitioner". You configure the partitioner in the > yaml config file and there is one partitioner, ByteOrderedPartitioner, that > is basically the identify function. > We however usually discourage user for using it because the partitioner is > global to a cluster and cannot be changed (you basically pick it at cluster > creation time and are stuck with it until the end of time), and since > ByteOrderedPartitioner can easily lead to hotspot in the data distribution > if you're not careful...For those reasons, the default partitioner is also > much more tested, and I can't remember anyone mentioning the partitioner > has been a bottleneck. > > Thanks for the info. I thought that this might be possible to adjust on a per-keyspace level. But if you can only do this globally, then I will leave it alone. Other than the (probably negibile) performance impact of hashing the hash again, there is nothing wrong with doing so. Hashing a SHA1-hash will give a good distribution. anyway, this is getting a bit off-topic. cheers, Rüdiger