I'm actually using Curator as a Zookeeper client myself. I haven't used it in production yet, but so far it seems well written and Jordan Zimmerman at Netflix has been great on the support end as well.
I haven't tried Cages so I can't really compare, but I think one of the main deciding factors between the two depends on which zk recipes you need. John On Thu, Dec 15, 2011 at 12:07 AM, Boris Yen <yulin...@gmail.com> wrote: > I am not sure if this is the right thread to ask about this. > > I read that some people are using cage+zookeeper. I was wondering if > anyone evaluates https://github.com/Netflix/curator? this seems to be a > versatile package. > > On Tue, Dec 13, 2011 at 6:06 AM, John Laban <j...@pagerduty.com> wrote: > >> Ok, great. I'll be sure to look into the virtualization-specific NTP >> guides. >> >> Another benefit of using Cassandra over Zookeeper for locking is that you >> don't have to worry about losing your connection to Zookeeper (and with it >> your locks) while hammering away at data in Cassandra. If using Cassandra >> for locks, if you lose your locks you lose your connection to the datastore >> too. (We're using long-ish session timeouts + connection listeners in ZK >> to mitigate that now.) >> >> John >> >> >> >> On Mon, Dec 12, 2011 at 12:55 PM, Dominic Williams < >> dwilli...@fightmymonster.com> wrote: >> >>> Hi John, >>> >>> On 12 December 2011 19:35, John Laban <j...@pagerduty.com> wrote: >>>> >>>> So I responded to your algorithm in another part of this thread (very >>>> interesting) but this part of the paper caught my attention: >>>> >>>> > When client application code releases a lock, that lock must not >>>> actually be >>>> > released for a period equal to one millisecond plus twice the maximum >>>> possible >>>> > drift of the clocks in the client computers accessing the Cassandra >>>> databases >>>> >>>> I've been worried about this, and added some arbitrary delay in the >>>> releasing of my locks. But I don't like it as it's (A) an arbitrary value >>>> and (B) it will - perhaps greatly - reduce the throughput of the more >>>> high-contention areas of my system. >>>> >>>> To fix (B) I'll probably just have to try to get rid of locks all >>>> together in these high-contention areas. >>>> >>>> To fix (A), I'd need to know what the maximum possible drift of my >>>> clocks will be. How did you determine this? What value do you use, out of >>>> curiosity? What does the network layout of your client machines look like? >>>> (Are any of your hosts geographically separated or all running in the same >>>> DC? What's the maximum latency between hosts? etc?) Do you monitor the >>>> clock skew on an ongoing basis? Am I worrying too much? >>>> >>> >>> If you setup NTP carefully no machine should drift more than 4ms say. I >>> forget where, but you'll find the best documentation on how to make a >>> bullet-proof NTP setup on vendor sites for virtualization software (because >>> virtualization software can cause drift so NTP setup has to be just so) >>> >>> What this means is that, for example, to be really safe when a thread >>> releases a lock you should wait say 9ms. Some points:- >>> -- since the sleep is performed before release, an isolated operation >>> should not be delayed at all >>> -- only a waiting thread or a thread requesting a lock immediately it is >>> released will be delayed, and no extra CPU or memory load is involved >>> -- in practice for the vast majority of "application layer" data >>> operations this restriction will have no effect on overall performance as >>> experienced by a user, because such operations nearly always read and write >>> to data with limited scope, for example the data of two users involved in >>> some transaction >>> -- the clocks issue does mean that you can't really serialize access to >>> more broadly shared data where more than 5 or 10 such requests are made a >>> second, say, but in reality even if the extra 9ms sleep on release wasn't >>> necessary, variability in database operation execution time (say under >>> load, or when something goes wrong) means trouble might occur serializing >>> with that level of contention >>> >>> So in summary, although this drift thing seems bad at first, partly >>> because it is a new consideration, in practice it's no big deal so long as >>> you look after your clocks (and the main issue to watch out for is when >>> application nodes running on virtualization software, hypervisors et al >>> have setup issues that make their clocks drift under load, and it is a good >>> idea to be wary of that) >>> >>> Best, Dominic >>> >>> >>>> Sorry for all the questions but I'm very concerned about this >>>> particular problem :) >>>> >>>> Thanks, >>>> John >>>> >>>> >>>> On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams < >>>> dwilli...@fightmymonster.com> wrote: >>>> >>>>> Hi guys, just thought I'd chip in... >>>>> >>>>> Fight My Monster is still using Cages, which is working fine, but... >>>>> >>>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are >>>>> 2 main reasons:- >>>>> >>>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we >>>>> aren't getting anywhere near to capacity and we do a *lot* >>>>> of serialisation) at some point it will be necessary to start hashing lock >>>>> paths onto separate ZooKeeper clusters, and I tend to believe that these >>>>> days you should choose platforms that handle sharding themselves (e.g. >>>>> choose Cassandra rather than MySQL) >>>>> >>>>> 2. Why have more components in your system when you can have less!!! >>>>> KISS >>>>> >>>>> Recently I therefore tried to devise an algorithm which can be used to >>>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa >>>>> etc. >>>>> >>>>> There is a doc describing the algorithm, to which may be added an >>>>> appendix describing a protocol so that locking can be interoperable >>>>> between >>>>> the clients. That could be extended to describe a protocol for >>>>> transactions. Word of warning this is a *beta* algorithm that has only >>>>> been >>>>> seen by a select group so far, and therefore not even 100% sure it works >>>>> but there is a useful general discussion regarding serialization of >>>>> reads/writes so I include it anyway (and since this algorithm is going to >>>>> be out there now, if there's anyone out there who fancies doing a Z proof >>>>> or disproof, that would be fantastic). >>>>> >>>>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf >>>>> >>>>> Final word on this re transactions: if/when transactions are added to >>>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better >>>>> performance than ZooKeeper for storing snapshots, especially as >>>>> transaction >>>>> size increases >>>>> >>>>> Best, Dominic >>>>> >>>>> On 11 December 2011 01:53, Guy Incognito <dnd1...@gmail.com> wrote: >>>>> >>>>>> you could try writing with the clock of the initial replay entry? >>>>>> >>>>>> On 06/12/2011 20:26, John Laban wrote: >>>>>> >>>>>> Ah, neat. It is similar to what was proposed in (4) above with >>>>>> adding transactions to Cages, but instead of snapshotting the data to be >>>>>> rolled back (the "before" data), you snapshot the data to be replayed >>>>>> (the >>>>>> "after" data). And then later, if you find that the transaction didn't >>>>>> complete, you just keep replaying the transaction until it takes. >>>>>> >>>>>> The part I don't understand with this approach though: how do you >>>>>> ensure that someone else didn't change the data between your initial >>>>>> failed >>>>>> transaction and the later replaying of the transaction? You could get >>>>>> lost >>>>>> writes in that situation. >>>>>> >>>>>> Dominic (in the Cages blog post) explained a workaround with that >>>>>> for his rollback proposal: all subsequent readers or writers of that >>>>>> data >>>>>> would have to check for abandoned transactions and roll them back >>>>>> themselves before they could read the data. I don't think this is >>>>>> possible >>>>>> with the XACT_LOG "replay" approach in these slides though, based on how >>>>>> the data is indexed (cassandra node token + timeUUID). >>>>>> >>>>>> >>>>>> PS: How are you liking Cages? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 2011/12/6 Jérémy SEVELLEC <jsevel...@gmail.com> >>>>>> >>>>>>> Hi John, >>>>>>> >>>>>>> I had exactly the same reflexions. >>>>>>> >>>>>>> I'm using zookeeper and cage to lock et isolate. >>>>>>> >>>>>>> but how to rollback? >>>>>>> It's impossible so try replay! >>>>>>> >>>>>>> the idea is explained in this presentation >>>>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting >>>>>>> from slide 24) >>>>>>> >>>>>>> - insert your whole data into one column >>>>>>> - make the job >>>>>>> - remove (or expire) your column. >>>>>>> >>>>>>> if there is a problem during "making the job", you keep the >>>>>>> possibility to replay and replay and replay (synchronously or in a >>>>>>> batch). >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Jérémy >>>>>>> >>>>>>> >>>>>>> 2011/12/5 John Laban <j...@pagerduty.com> >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I'm building a system using Cassandra as a datastore and I have a >>>>>>>> few places where I am need of transactions. >>>>>>>> >>>>>>>> I'm using ZooKeeper to provide locking when I'm in need of some >>>>>>>> concurrency control or isolation, so that solves that half of the >>>>>>>> puzzle. >>>>>>>> >>>>>>>> What I need now is to sometimes be able to get atomicity across >>>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of >>>>>>>> a >>>>>>>> relational DB. In other words, there are places where I need to >>>>>>>> perform >>>>>>>> multiple updates/inserts, and if I fail partway through, I would >>>>>>>> ideally be >>>>>>>> able to rollback the partially-applied updates. >>>>>>>> >>>>>>>> Now, I *know* this isn't possible with Cassandra. What I'm >>>>>>>> looking for are all the best practices, or at least tips and tricks, so >>>>>>>> that I can get around this limitation in Cassandra and still maintain a >>>>>>>> consistent datastore. (I am using quorum reads/writes so that eventual >>>>>>>> consistency doesn't kick my ass here as well.) >>>>>>>> >>>>>>>> Below are some ideas I've been able to dig up. Please let me >>>>>>>> know if any of them don't make sense, or if there are better >>>>>>>> approaches: >>>>>>>> >>>>>>>> >>>>>>>> 1) Updates to a row in a column family are atomic. So try to >>>>>>>> model your data so that you would only ever need to update a single >>>>>>>> row in >>>>>>>> a single CF at once. Essentially, you model your data around >>>>>>>> transactions. >>>>>>>> This is tricky but can certainly be done in some situations. >>>>>>>> >>>>>>>> 2) If you are only dealing with multiple row *inserts* (and not >>>>>>>> updates), have one of the rows act as a 'commit' by essentially >>>>>>>> validating >>>>>>>> the presence of the other rows. For example, say you were performing >>>>>>>> an >>>>>>>> operation where you wanted to create an Account row and 5 User rows >>>>>>>> all at >>>>>>>> once (this is an unlikely example, but bear with me). You could >>>>>>>> insert 5 >>>>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which >>>>>>>> acts >>>>>>>> as the commit. If something went wrong before the Account could be >>>>>>>> created, any Users that had been created so far would be orphaned and >>>>>>>> unusable, as your business logic can ensure that they can't exist >>>>>>>> without >>>>>>>> an Account. You could also have an offline cleanup process that swept >>>>>>>> away >>>>>>>> orphans. >>>>>>>> >>>>>>>> 3) Try to model your updates as idempotent column inserts >>>>>>>> instead. How do you model updates as inserts? Instead of munging the >>>>>>>> value directly, you could insert a column containing the operation you >>>>>>>> want >>>>>>>> to perform (like "+5"). It would work kind of like the Consistent Vote >>>>>>>> Counting implementation: ( https://gist.github.com/416666 ). How >>>>>>>> do you make the inserts idempotent? Make sure the column names >>>>>>>> correspond >>>>>>>> to a request ID or some other identifier that would be identical across >>>>>>>> re-drives of a given (perhaps originally failed) request. This could >>>>>>>> leave >>>>>>>> your datastore in a temporarily inconsistent state, but would >>>>>>>> eventually >>>>>>>> become consistent after a successful re-drive of the original request. >>>>>>>> >>>>>>>> 4) You could take an approach like Dominic Williams proposed with >>>>>>>> Cages: >>>>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/ >>>>>>>> The gist is that you snapshot all the original values that you're >>>>>>>> about >>>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, >>>>>>>> and >>>>>>>> then delete the snapshot (and that delete needs to be atomic). If the >>>>>>>> snapshot data was never deleted, then subsequent accessors (even >>>>>>>> readers) >>>>>>>> of the data rows need to do the rollback of the previous transaction >>>>>>>> themselves before they can read/write this data. They do the rollback >>>>>>>> by >>>>>>>> just overwriting the current values with what is in the snapshot. It >>>>>>>> offloads the work of the rollback to the next worker that accesses the >>>>>>>> data. This approach probably needs an generic/high-level programming >>>>>>>> layer >>>>>>>> to handle all of the details and complexity, and it doesn't seem like >>>>>>>> it >>>>>>>> was ever added to Cages. >>>>>>>> >>>>>>>> >>>>>>>> Are there other approaches or best practices that I missed? I >>>>>>>> would be very interested in hearing any opinions from those who have >>>>>>>> tackled these problems before. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> John >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jérémy >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >