> and compare them, but at this point I need to focus on one to get
> things working, so I'm trying to make a best initial guess.
I would go for RP then, BOP may look like less work to start with but it *will* 
bite you later. If you use an increasing version number as a key you will get a 
hot spot. Get it working with RP and Standard CF's, accept the extra lookups, 
and then see if where you are performance / complexity wise. Cassandra can be 
pretty fast.

I still don't really understand the problem, but I think you have many lists of 
names and when each list is updated you consider it a version. 

You then want to answer a query such as "get all the names between foo and bar 
that were written to between version 100 and 200". Can this query can be 
re-written as "get all the names between foo and bar that existed at version 
200 and were created on or after version 100" ?

Could you re-write the entire list every version update?

CF: VersionedList
row: <list_name:version>
col_name: name
col_value: last updated version

So you slice one row at the upper version and discard all the columns where the 
value is less than the lower version ? 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 27/01/2012, at 5:31 AM, Bryce Allen wrote:

> Thanks, comments inline:
> 
> On Mon, 23 Jan 2012 20:59:34 +1300
> aaron morton <aa...@thelastpickle.com> wrote:
>> It depends a bit on the data and the query patterns. 
>> 
>> * How many versions do you have ? 
> We may have 10k versions in some cases, with up to a million names
> total in any given version but more often <10K. To manage this we are
> currently using two CFs, one for storing compacted complete lists and
> one for storing deltas on the compacted list. Based on usage, we will
> create a new compacted list and start writing deltas against that. We
> should be able to limit the number of deltas in a single row to below
> 100; I'd like to be able to keep it lower but I'm not sure we can
> maintain that under all load scenarios. The compacted lists are
> straightforward, but there are many ways to structure the deltas and
> they all have trade offs. A CF with composite columns that supported
> two dimensional slicing would be perfect.
> 
>> * How many names in each version ?
> We plan on limiting to a total of 1 million names, and around 10,000 per
> version (by limiting the batch size), but many deltas will have <10
> names.
> 
>> * When querying do you know the versions numbers you want to query
>> from ? How many are there normally?
> Currently we don't know the version numbers in advance - they are
> timestamps, and we are querying for versions less than or equal to the
> desired timestamp. We have talked about using vector clock versions and
> maintaining an index mapping time to version numbers, in which case we
> would know the exact versions after the index lookup, at the expense of
> another RTT on every operation.
> 
>> * How frequent are the updates and the reads ?
> We expect reads to be more frequent than writes. Unfortunately we don't
> have solid numbers on what to expect, but I would guess 20x. Update
> operations will involve several reads to determine where to write.
> 
> 
>> I would lean towards using two standard CF's, one to list all the
>> version numbers (in a single row probably) and one to hold the names
>> in a particular version. 
>> 
>> To do your query slice the first CF and then run multi gets to the
>> second. 
>> 
>> Thats probably not the best solution, if you can add some more info
>> it may get better.
> I'm actually leaning back toward BOP, as I run into more issues
> and complexity with the RP models. I'd really like to implement both
> and compare them, but at this point I need to focus on one to get
> things working, so I'm trying to make a best initial guess.
> 
> 
>> 
>> On 21/01/2012, at 6:20 AM, Bryce Allen wrote:
>> 
>>> I'm storing very large versioned lists of names, and I'd like to
>>> query a range of names within a given range of versions, which is a
>>> two dimensional slice, in a single query. This is easy to do using
>>> ByteOrderedPartitioner, but seems to require multiple (non parallel)
>>> queries and extra CFs when using RandomPartitioner.
>>> 
>>> I see two approaches when using RP:
>>> 
>>> 1) Data is stored in a super column family, with one dimension being
>>> the super column names and the other the sub column names. Since
>>> slicing on sub columns requires a list of super column names, a
>>> second standard CF is needed to get a range of names before doing a
>>> query on the main super CF. With CASSANDRA-2710, the same is
>>> possible using a standard CF with composite types instead of a
>>> super CF.
>>> 
>>> 2) If one of the dimensions is small, a two dimensional slice isn't
>>> required. The data can be stored in a standard CF with linear
>>> ordering on a composite type (large_dimension, small_dimension).
>>> Data is queried based on the large dimension, and the client throws
>>> out the extra data in the other dimension.
>>> 
>>> Neither of the above solutions are ideal. Does anyone else have a
>>> use case where two dimensional slicing is useful? Given the
>>> disadvantages of BOP, is it practical to make the composite column
>>> query model richer to support this sort of use case?
>>> 
>>> Thanks,
>>> Bryce
>> 

Reply via email to