On 06/04/2010 15:26, Eric Evans wrote:
> On Tue, 2010-04-06 at 12:00 +0100, Steve wrote:
>   
>> First, I apologise sending this to the 'dev' mailing list - I couldn't
>> find one for Cassandra users - and also for the basic nature of my
>> questions...
>>     
> user@cassandra.apache.org, (follow-ups there).
>   
Thanks...  I'm now subscribed.
>> I'm trying to get my head around the possibility of using Cassandra as
>> the back-end to a project... and while, in most respects, Cassandra
>> looks absolutely ideal... I'm finding it difficult to ascertain an
>> appropriate strategy to ensure consistency (which would come 'for free'
>> with a traditional, transactional, back end.)
>>
>> As a sufficient abstraction of my storage requirements, imagine two
>> (application oriented) universes of SHA-512 hashes - SRC and DST (each
>> will, in practice, be tagged with other application data).  I need to
>> support a remote API to manage a one-many mapping from SRC to DST, and a
>> consistent (efficiently addressable) one-one mapping from DST to SRC.  I
>> need to envisage millions of clients and tens of billions of mappings
>> with billions of updates and lookups daily...
>>
>> newAssoc(s:SRC,d:DST)
>> listAssoc(s:SRC) => List<d:DST>
>> delAssoc(d:DST)
>> lookupAssoc(d:DST) => s:SRC
>>
>> If I were using BDB, I'd have two maps - the first with s:SRC as key and
>> d:DST as value - the second with (d:DST,s:SRC) as key with no values....
>> and update these maps in a transaction.
>>     
> You could model it like this with Cassandra as well.
>
> It sounds like the real question though is, how can you structure this
> to work given Cassandra's eventual consistency and record-level
> atomicity?
>   
Yes, that's another way to ask the same question.

Based upon my understanding of Cassandra to date, it seems to me that
using two maps (i.e. the same data structure as I suggested for BDB) I'd
not have the same consistency... even eventually... as a consequence of
the record-level atomicity.  I have to assume a fail-stop between
writing the first and second keys... under those circumstances the
stored data will be inconsistent with the logical model's requirements.
Sure, I could write a program to verify that, for each SRC value, every
associated DST value is mapped back (assuming the SRC->DST mapping is
written first) but that would likely lead to prohibitively slow start-up
times and/or a significant overhead for every access.

> For example, QUORUM consistency for reads and writes are enough to
> ensure that your SRC->DST mappings remain unique and consistent from the
> perspective of the client. And, if you can't make your application
> resilient to inconsistencies in the inverted index (detect, repair,
> etc), you could always use a Zookeeper-based mutex.
>   
I think it looks impractical for me to make the client resilient to
inconsistent data in this context (though, I agree, that would be the
ideal solution if it were viable... and will be the preferred solution
elsewhere in the project.)

I've read all about QUORUM, and it is generally useful, but as far as I
can tell, it can't give me a transaction... if my replicated servers are
all supplied by a common mains electricity source, it's very likely that
all the replicated nodes will fail-stop at exactly the same time...
hence providing me with no protection against one map being updated and
not the other.  While I've only recently started looking at Zookeeper, I
suspect (at least partly because I understand mutexes) that it won't
address this issue either.  I guess it would be helpful if I needed to
be sure that two identical DST hashes are not mapped to different SRC
hashes at (almost exactly) the same time - but given that no collisions
are expected... I can probably back-out from that error condition
safely-enough without needing to introduce the bottleneck of running all
my updates through a single Zookeeper process.

> If you haven't already I'd suggest reading the Amazon whitepaper on
> Dynamo[1] to understand eventual consistency, and Cassandra's API[2]
> docs for how to apply it here.
>
> [1]:
> http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
> [2]: http://wiki.apache.org/cassandra/API#ConsistencyLevel
>   
I'd already read the API - the dynamo document is interesting, too...
though I didn't see anything especially relevant in it. Perhaps, I'm
overlooking what you wanted to point out?

>> If I were in SQL land, I'd need a table a bit like this:
>>
>> create table Assoc( src binary(64) , dst binary(64) unique, primary key
>> (src,dst) )
>>
>> The implementations of the API calls would be trivial insert, delete and
>> select operations - and consistency between the primary key and the
>> implicit (unique constraint) index would arise as a consequence of
>> transactions.  I realise that, with Cassandra, I need a different
>> approach - since I don't have the same notion of transactions on which
>> to rely... and, in any case, given a desire for scalability, relying
>> upon such fine grained transactions would definitely prove a
>> bottleneck.  That said, the uniqueness of DST values is systemically
>> critical - so, even while I do not anticipate duplicate hashes in
>> practice, I need uniqueness to be verified - and for the second SRC
>> values asking to associate with an existing DST value to fail without
>> violating the one->one mapping from DST to SRC... and for this failure
>> to be notified ASAP.
>>
>> It strikes me that a plausible design might be one that writes a log of
>> 'insert/delete' with pairs of hashes which some background process
>> eventually indexes in a big batch... before clearing the transaction
>> log.  If this is "The Cassandra Way" - I'm surprised not to have found
>> any examples... am I looking in the wrong places for them?  Is my log of
>> 'insert' and 'delete' operations something I'd implement myself using
>> ad-hoc techniques, or is there explicit support for this in Cassandra? 
>> Do I need to develop my own process (from scratch) to merge updates with
>> on-disk data - or is there a neat way to get Cassandra to do that for me?
>>
>> Another issue I'm considering is if I should map from SRC to a list of
>> DST as my low-level representation with Cassandra... or should I map
>> individually.  A potential problem is that one SRC value can map to
>> arbitrarily many DST values.   At the level of the RPC API, I can
>> address this by returning an object resembling a scrollable cursor
>> instead of a static list - but, I guess, I'd need to be concerned about
>> resource limitations (memory, etc.) for the on-disk representation?  I
>> presume that there's a significant advantage to storing the one-to-many
>> map explicitly (locality of reference, for example) - as well as
>> minimising the size of the encoded data... I'm guessing that there is no
>> prefix-compression for keys?  Key compression would likely lead to the
>> opposite architectural decisions from a resource-use perspective... and
>> would eliminate concerns about maps from single SRC values to very large
>> numbers of DST values.
>>     
I'm still interested in whether or not Cassandra does (or will) support
key prefix compression...  That could greatly influence architectural
choices.

I'm also interested to establish if others effectively gain ACID
transactions by first writing application-oriented transactional data to
a log (which can be applied on-the-fly to data read from the main
keymap) while a background process applies the transitions in the
background (in an idempotent style) - thus securing the appearance of
traditional transactions.  Is this something not recommended because,
where Cassandra has been typically deployed, it hasn't been necessary -
or is there a deeper reason?

Reply via email to