thanks a lot for the help! I have read the post and think 0.8 might be good enough for me, especially 0.8.5.
also change gc_grace_seconds is a acceptable solution. On Wed, Sep 14, 2011 at 4:03 PM, Sylvain Lebresne <sylv...@datastax.com>wrote: > On Wed, Sep 14, 2011 at 9:27 AM, Yan Chunlu <springri...@gmail.com> wrote: > > is 0.8 ready for production use? > > some related discussion here: > http://www.mail-archive.com/user@cassandra.apache.org/msg17055.html > but my personal answer is yes. > > > as I know currently many companies including reddit.com are using 0.7, > how > > does they get rid of the repair problem? > > Repair problems in 0.7 don't hit everyone equally. For some people, it > works > relatively well even if not in the most efficient ways. Also, for some > workload > (if you don't do much deletes for instance), you can set a big > gc_grace_seconds > value (say a month) and only run repair that often, which can make repair > inefficiencies more bearable. > That being said, I can't speak for "many companies", but I do advise > evaluating > an upgrade to 0.8. > > -- > Sylvain > > > > > On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne <sylv...@datastax.com> > > wrote: > >> > >> On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu <springri...@gmail.com> > wrote: > >> > me neither don't want to repair one CF at the time. > >> > the "node repair" took a week and still running, compactionstats and > >> > netstream shows nothing is running on every node, and also no error > >> > message, no exception, really no idea what was it doing, > >> > >> To add to the list of things repair does wrong in 0.7, we'll have to add > >> that > >> if one of the node participating in the repair (so any node that share a > >> range > >> with the node on which repair was started) goes down (even for a short > >> time), > >> then the repair will simply hang forever doing nothing. And no specific > >> error message will be logged. That could be what happened. Again, recent > >> releases of 0.8 fix that too. > >> > >> -- > >> Sylvain > >> > >> > I stopped yesterday. maybe I should run repair again while disable > >> > compaction on all nodes? > >> > thanks! > >> > > >> > On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller > >> > <peter.schul...@infidyne.com> wrote: > >> >> > >> >> > I think it is a serious problem since I can not "repair"..... I am > >> >> > using cassandra on production servers. is there some way to fix it > >> >> > without upgrade? I heard of that 0.8.x is still not quite ready in > >> >> > production environment. > >> >> > >> >> It is a serious issue if you really need to repair one CF at the > time. > >> >> However, looking at your original post it seems this is not > >> >> necessarily your issue. Do you need to, or was your concern rather > the > >> >> overall time repair took? > >> >> > >> >> There are other things that are improved in 0.8 that affect 0.7. In > >> >> particular, (1) in 0.7 compaction, including validating compactions > >> >> that are part of repair, is non-concurrent so if your repair starts > >> >> while there is a long-running compaction going it will have to wait, > >> >> and (2) semi-related is that the merkle tree calculation that is part > >> >> of repair/anti-entropy may happen "out of synch" if one of the nodes > >> >> participating happen to be busy with compaction. This in turns causes > >> >> additional data to be sent as part of repair. > >> >> > >> >> That might be why your immediately following repair took a long time, > >> >> but it's difficult to tell. > >> >> > >> >> If you're having issues with repair and large data sets, I would > >> >> generally say that upgrading to 0.8 is recommended. However, if > you're > >> >> on 0.7.4, beware of > >> >> https://issues.apache.org/jira/browse/CASSANDRA-3166 > >> >> > >> >> -- > >> >> / Peter Schuller (@scode on twitter) > >> > > >> > > > > > >