I'm just becoming aware of the restrictions of using an OPP as compared
to Random. Please let me know if I understand this correctly.
First off, if using the OPP only for an increased performance of range
queries, then it will probably be very hard to predict if you will end
up with hotspots or not and thus where and even how the data may be
clustered together in a particular node. This is because all the
various keys of the various CFs may or may not have any correlation with
one another. So, in effect, you just have a big mess of keys of various
ranges and formats, but they all are partitioned according to one global
set of tokens that apply to ALL CFs of ALL keyspaces.
[main reason for post below...]
OTOH, if you want to use OPP to purposely cluster certain data together
on specific nodes, such as for geographic partitioning, then you have to
choose a prefix for all of the keys of ALL CFs and ALL keyspaces! This
is because they will all be partitioned based on the tokens assigned to
the nodes. IOW, if I had two datacenters, one in the US and another in
Europe, then for all rows in all KSs and in all CFs, I would need to
prepend a prefix to the keys, such as "US:" and "EU:". The problem is I
may not want ALL of my CFs to be partitioned this way; only specific
ones. Also, it may be very difficult if not impossible for all keys of
all keyspaces and CFs to use keys of this form. I'm not sure if Cass is
designed for this.
However, if using the random partitioner, then there is no problem. You
can use any key of any type you want (UTF8, Long, etc.) since they are
all hashed before deciding which node gets the key/row.
Do I understand things correctly or am I missing something? Is Cass
designed to use OPP this way or am I hacking it? If so, is there an
acceptable way to do geographic partitioning?
Also, what is OPP really good for?
Thanks!