is 0.8 ready for production use? as I know currently many companies including reddit.com are using 0.7, how does they get rid of the repair problem?
On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne <sylv...@datastax.com>wrote: > On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu <springri...@gmail.com> wrote: > > me neither don't want to repair one CF at the time. > > the "node repair" took a week and still running, compactionstats and > > netstream shows nothing is running on every node, and also no error > > message, no exception, really no idea what was it doing, > > To add to the list of things repair does wrong in 0.7, we'll have to add > that > if one of the node participating in the repair (so any node that share a > range > with the node on which repair was started) goes down (even for a short > time), > then the repair will simply hang forever doing nothing. And no specific > error message will be logged. That could be what happened. Again, recent > releases of 0.8 fix that too. > > -- > Sylvain > > > I stopped yesterday. maybe I should run repair again while disable > > compaction on all nodes? > > thanks! > > > > On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller > > <peter.schul...@infidyne.com> wrote: > >> > >> > I think it is a serious problem since I can not "repair"..... I am > >> > using cassandra on production servers. is there some way to fix it > >> > without upgrade? I heard of that 0.8.x is still not quite ready in > >> > production environment. > >> > >> It is a serious issue if you really need to repair one CF at the time. > >> However, looking at your original post it seems this is not > >> necessarily your issue. Do you need to, or was your concern rather the > >> overall time repair took? > >> > >> There are other things that are improved in 0.8 that affect 0.7. In > >> particular, (1) in 0.7 compaction, including validating compactions > >> that are part of repair, is non-concurrent so if your repair starts > >> while there is a long-running compaction going it will have to wait, > >> and (2) semi-related is that the merkle tree calculation that is part > >> of repair/anti-entropy may happen "out of synch" if one of the nodes > >> participating happen to be busy with compaction. This in turns causes > >> additional data to be sent as part of repair. > >> > >> That might be why your immediately following repair took a long time, > >> but it's difficult to tell. > >> > >> If you're having issues with repair and large data sets, I would > >> generally say that upgrading to 0.8 is recommended. However, if you're > >> on 0.7.4, beware of > >> https://issues.apache.org/jira/browse/CASSANDRA-3166 > >> > >> -- > >> / Peter Schuller (@scode on twitter) > > > > >