Re: best practices for simulating transactions in Cassandra

John Laban Thu, 15 Dec 2011 11:41:30 -0800

I'm actually using Curator as a Zookeeper client myself.  I haven't used it
in production yet, but so far it seems well written and Jordan Zimmerman at
Netflix has been great on the support end as well.


I haven't tried Cages so I can't really compare, but I think one of the
main deciding factors between the two depends on which zk recipes you need.

John


On Thu, Dec 15, 2011 at 12:07 AM, Boris Yen <yulin...@gmail.com> wrote:

> I am not sure if this is the right thread to ask about this.
>
> I read that some people are using cage+zookeeper. I was wondering if
> anyone evaluates https://github.com/Netflix/curator? this seems to be a
> versatile package.
>
> On Tue, Dec 13, 2011 at 6:06 AM, John Laban <j...@pagerduty.com> wrote:
>
>> Ok, great.  I'll be sure to look into the virtualization-specific NTP
>> guides.
>>
>> Another benefit of using Cassandra over Zookeeper for locking is that you
>> don't have to worry about losing your connection to Zookeeper (and with it
>> your locks) while hammering away at data in Cassandra.  If using Cassandra
>> for locks, if you lose your locks you lose your connection to the datastore
>> too.   (We're using long-ish session timeouts + connection listeners in ZK
>> to mitigate that now.)
>>
>> John
>>
>>
>>
>> On Mon, Dec 12, 2011 at 12:55 PM, Dominic Williams <
>> dwilli...@fightmymonster.com> wrote:
>>
>>> Hi John,
>>>
>>> On 12 December 2011 19:35, John Laban <j...@pagerduty.com> wrote:
>>>>
>>>> So I responded to your algorithm in another part of this thread (very
>>>> interesting) but this part of the paper caught my attention:
>>>>
>>>> > When client application code releases a lock, that lock must not
>>>> actually be
>>>> > released for a period equal to one millisecond plus twice the maximum
>>>> possible
>>>> > drift of the clocks in the client computers accessing the Cassandra
>>>> databases
>>>>
>>>> I've been worried about this, and added some arbitrary delay in the
>>>> releasing of my locks.  But I don't like it as it's (A) an arbitrary value
>>>> and (B) it will - perhaps greatly - reduce the throughput of the more
>>>> high-contention areas of my system.
>>>>
>>>> To fix (B) I'll probably just have to try to get rid of locks all
>>>> together in these high-contention areas.
>>>>
>>>> To fix (A), I'd need to know what the maximum possible drift of my
>>>> clocks will be.  How did you determine this?  What value do you use, out of
>>>> curiosity?  What does the network layout of your client machines look like?
>>>>  (Are any of your hosts geographically separated or all running in the same
>>>> DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
>>>> clock skew on an ongoing basis?  Am I worrying too much?
>>>>
>>>
>>> If you setup NTP carefully no machine should drift more than 4ms say. I
>>> forget where, but you'll find the best documentation on how to make a
>>> bullet-proof NTP setup on vendor sites for virtualization software (because
>>> virtualization software can cause drift so NTP setup has to be just so)
>>>
>>> What this means is that, for example, to be really safe when a thread
>>> releases a lock you should wait say 9ms. Some points:-
>>> -- since the sleep is performed before release, an isolated operation
>>> should not be delayed at all
>>> -- only a waiting thread or a thread requesting a lock immediately it is
>>> released will be delayed, and no extra CPU or memory load is involved
>>> -- in practice for the vast majority of "application layer" data
>>> operations this restriction will have no effect on overall performance as
>>> experienced by a user, because such operations nearly always read and write
>>> to data with limited scope, for example the data of two users involved in
>>> some transaction
>>> -- the clocks issue does mean that you can't really serialize access to
>>> more broadly shared data where more than 5 or 10 such requests are made a
>>> second, say, but in reality even if the extra 9ms sleep on release wasn't
>>> necessary, variability in database operation execution time (say under
>>> load, or when something goes wrong) means trouble might occur serializing
>>> with that level of contention
>>>
>>> So in summary, although this drift thing seems bad at first, partly
>>> because it is a new consideration, in practice it's no big deal so long as
>>> you look after your clocks (and the main issue to watch out for is when
>>> application nodes running on virtualization software, hypervisors et al
>>> have setup issues that make their clocks drift under load, and it is a good
>>> idea to be wary of that)
>>>
>>> Best, Dominic
>>>
>>>
>>>> Sorry for all the questions but I'm very concerned about this
>>>> particular problem :)
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>>
>>>> On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
>>>> dwilli...@fightmymonster.com> wrote:
>>>>
>>>>> Hi guys, just thought I'd chip in...
>>>>>
>>>>> Fight My Monster is still using Cages, which is working fine, but...
>>>>>
>>>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are
>>>>> 2 main reasons:-
>>>>>
>>>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we
>>>>> aren't getting anywhere near to capacity and we do a *lot*
>>>>> of serialisation) at some point it will be necessary to start hashing lock
>>>>> paths onto separate ZooKeeper clusters, and I tend to believe that these
>>>>> days you should choose platforms that handle sharding themselves (e.g.
>>>>> choose Cassandra rather than MySQL)
>>>>>
>>>>> 2. Why have more components in your system when you can have less!!!
>>>>> KISS
>>>>>
>>>>> Recently I therefore tried to devise an algorithm which can be used to
>>>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>>>>> etc.
>>>>>
>>>>> There is a doc describing the algorithm, to which may be added an
>>>>> appendix describing a protocol so that locking can be interoperable 
>>>>> between
>>>>> the clients. That could be extended to describe a protocol for
>>>>> transactions. Word of warning this is a *beta* algorithm that has only 
>>>>> been
>>>>> seen by a select group so far, and therefore not even 100% sure it works
>>>>> but there is a useful general discussion regarding serialization of
>>>>> reads/writes so I include it anyway (and since this algorithm is going to
>>>>> be out there now, if there's anyone out there who fancies doing a Z proof
>>>>> or disproof, that would be fantastic).
>>>>>
>>>>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>>>>
>>>>> Final word on this re transactions: if/when transactions are added to
>>>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>>>>> performance than ZooKeeper for storing snapshots, especially as 
>>>>> transaction
>>>>> size increases
>>>>>
>>>>> Best, Dominic
>>>>>
>>>>> On 11 December 2011 01:53, Guy Incognito <dnd1...@gmail.com> wrote:
>>>>>
>>>>>>  you could try writing with the clock of the initial replay entry?
>>>>>>
>>>>>> On 06/12/2011 20:26, John Laban wrote:
>>>>>>
>>>>>> Ah, neat.  It is similar to what was proposed in (4) above with
>>>>>> adding transactions to Cages, but instead of snapshotting the data to be
>>>>>> rolled back (the "before" data), you snapshot the data to be replayed 
>>>>>> (the
>>>>>> "after" data).  And then later, if you find that the transaction didn't
>>>>>> complete, you just keep replaying the transaction until it takes.
>>>>>>
>>>>>>  The part I don't understand with this approach though:  how do you
>>>>>> ensure that someone else didn't change the data between your initial 
>>>>>> failed
>>>>>> transaction and the later replaying of the transaction?  You could get 
>>>>>> lost
>>>>>> writes in that situation.
>>>>>>
>>>>>>  Dominic (in the Cages blog post) explained a workaround with that
>>>>>> for his rollback proposal:  all subsequent readers or writers of that 
>>>>>> data
>>>>>> would have to check for abandoned transactions and roll them back
>>>>>> themselves before they could read the data.  I don't think this is 
>>>>>> possible
>>>>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>>>>> the data is indexed (cassandra node token + timeUUID).
>>>>>>
>>>>>>
>>>>>>  PS:  How are you liking Cages?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2011/12/6 Jérémy SEVELLEC <jsevel...@gmail.com>
>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>>  I had exactly the same reflexions.
>>>>>>>
>>>>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>>>>
>>>>>>>  but how to rollback?
>>>>>>> It's impossible so try replay!
>>>>>>>
>>>>>>>  the idea is explained in this presentation
>>>>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>>>>> from slide 24)
>>>>>>>
>>>>>>>  - insert your whole data into one column
>>>>>>> - make the job
>>>>>>> - remove (or expire) your column.
>>>>>>>
>>>>>>>  if there is a problem during "making the job", you keep the
>>>>>>> possibility to replay and replay and replay (synchronously or in a 
>>>>>>> batch).
>>>>>>>
>>>>>>>  Regards
>>>>>>>
>>>>>>>  Jérémy
>>>>>>>
>>>>>>>
>>>>>>> 2011/12/5 John Laban <j...@pagerduty.com>
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>>>>> few places where I am need of transactions.
>>>>>>>>
>>>>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>>>>> concurrency control or isolation, so that solves that half of the 
>>>>>>>> puzzle.
>>>>>>>>
>>>>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of 
>>>>>>>> a
>>>>>>>> relational DB.  In other words, there are places where I need to 
>>>>>>>> perform
>>>>>>>> multiple updates/inserts, and if I fail partway through, I would 
>>>>>>>> ideally be
>>>>>>>> able to rollback the partially-applied updates.
>>>>>>>>
>>>>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm
>>>>>>>> looking for are all the best practices, or at least tips and tricks, so
>>>>>>>> that I can get around this limitation in Cassandra and still maintain a
>>>>>>>> consistent datastore.  (I am using quorum reads/writes so that eventual
>>>>>>>> consistency doesn't kick my ass here as well.)
>>>>>>>>
>>>>>>>>  Below are some ideas I've been able to dig up.  Please let me
>>>>>>>> know if any of them don't make sense, or if there are better 
>>>>>>>> approaches:
>>>>>>>>
>>>>>>>>
>>>>>>>>  1) Updates to a row in a column family are atomic.  So try to
>>>>>>>> model your data so that you would only ever need to update a single 
>>>>>>>> row in
>>>>>>>> a single CF at once.  Essentially, you model your data around 
>>>>>>>> transactions.
>>>>>>>>  This is tricky but can certainly be done in some situations.
>>>>>>>>
>>>>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>>>>> updates), have one of the rows act as a 'commit' by essentially 
>>>>>>>> validating
>>>>>>>> the presence of the other rows.  For example, say you were performing 
>>>>>>>> an
>>>>>>>> operation where you wanted to create an Account row and 5 User rows 
>>>>>>>> all at
>>>>>>>> once (this is an unlikely example, but bear with me).  You could 
>>>>>>>> insert 5
>>>>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which 
>>>>>>>> acts
>>>>>>>> as the commit.  If something went wrong before the Account could be
>>>>>>>> created, any Users that had been created so far would be orphaned and
>>>>>>>> unusable, as your business logic can ensure that they can't exist 
>>>>>>>> without
>>>>>>>> an Account.  You could also have an offline cleanup process that swept 
>>>>>>>> away
>>>>>>>> orphans.
>>>>>>>>
>>>>>>>>  3) Try to model your updates as idempotent column inserts
>>>>>>>> instead.  How do you model updates as inserts?  Instead of munging the
>>>>>>>> value directly, you could insert a column containing the operation you 
>>>>>>>> want
>>>>>>>> to perform (like "+5").  It would work kind of like the Consistent Vote
>>>>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How
>>>>>>>> do you make the inserts idempotent?  Make sure the column names 
>>>>>>>> correspond
>>>>>>>> to a request ID or some other identifier that would be identical across
>>>>>>>> re-drives of a given (perhaps originally failed) request.  This could 
>>>>>>>> leave
>>>>>>>> your datastore in a temporarily inconsistent state, but would 
>>>>>>>> eventually
>>>>>>>> become consistent after a successful re-drive of the original request.
>>>>>>>>
>>>>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>>>>> Cages:
>>>>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
>>>>>>>>    The gist is that you snapshot all the original values that you're 
>>>>>>>> about
>>>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, 
>>>>>>>> and
>>>>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>>>>> snapshot data was never deleted, then subsequent accessors (even 
>>>>>>>> readers)
>>>>>>>> of the data rows need to do the rollback of the previous transaction
>>>>>>>> themselves before they can read/write this data.  They do the rollback 
>>>>>>>> by
>>>>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>>>>> data.  This approach probably needs an generic/high-level programming 
>>>>>>>> layer
>>>>>>>> to handle all of the details and complexity, and it doesn't seem like 
>>>>>>>> it
>>>>>>>> was ever added to Cages.
>>>>>>>>
>>>>>>>>
>>>>>>>>  Are there other approaches or best practices that I missed?  I
>>>>>>>> would be very interested in hearing any opinions from those who have
>>>>>>>> tackled these problems before.
>>>>>>>>
>>>>>>>>  Thanks!
>>>>>>>>  John
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>>> Jérémy
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best practices for simulating transactions in Cassandra

Reply via email to