OK, thanks, so I see we had the same problem (I too had multiple keyspace, not that I know why it matters to the problem at hand) and I see that by upgrading to 0.6.7 you solved your problem (I didn't try it, had a different workaround) but frankly, I don't understand how https://issues.apache.org/jira/browse/CASSANDRA-1676 would relate the the "stuck bootstrap" problem (I'm not saying that it isn't, I'd just like to understand why...)
On Wed, Jan 5, 2011 at 5:42 PM, Thibaut Britz <thibaut.br...@trendiction.com > wrote: > Had the same Problem a while ago. Upgrading solved the problem (Don't know > if you have to redeploy your cluster though) > > http://www.mail-archive.com/user@cassandra.apache.org/msg07106.html > > > > On Wed, Jan 5, 2011 at 4:29 PM, Ran Tavory <ran...@gmail.com> wrote: > >> @Thibaut wrong email? Or how's "Avoid dropping messages off the client >> request path" (CASSANDRA-1676) related to the bootstrap questions I had? >> >> >> On Wed, Jan 5, 2011 at 5:23 PM, Thibaut Britz < >> thibaut.br...@trendiction.com> wrote: >> >>> https://issues.apache.org/jira/browse/CASSANDRA-1676 >>> >>> you have to use at least 0.6.7 >>> >>> >>> >>> On Wed, Jan 5, 2011 at 4:19 PM, Edward Capriolo >>> <edlinuxg...@gmail.com>wrote: >>> >>>> On Wed, Jan 5, 2011 at 10:05 AM, Ran Tavory <ran...@gmail.com> wrote: >>>> > In storage-conf I see this comment [1] from which I understand that >>>> the >>>> > recommended way to bootstrap a new node is to set AutoBootstrap=true >>>> and >>>> > remove itself from the seeds list. >>>> > Moreover, I did try to set AutoBootstrap=true and have the node in its >>>> own >>>> > seeds list, but it would not bootstrap. I don't recall the exact >>>> message but >>>> > it was something like "I found myself in the seeds list therefore I'm >>>> not >>>> > going to bootstrap even though AutoBootstrap is true". >>>> > >>>> > [1] >>>> > <!-- >>>> > ~ Turn on to make new [non-seed] nodes automatically migrate the >>>> right >>>> > data >>>> > ~ to themselves. (If no InitialToken is specified, they will pick >>>> one >>>> > ~ such that they will get half the range of the most-loaded node.) >>>> > ~ If a node starts up without bootstrapping, it will mark itself >>>> > bootstrapped >>>> > ~ so that you can't subsequently accidently bootstrap a node with >>>> > ~ data on it. (You can reset this by wiping your data and >>>> commitlog >>>> > ~ directories.) >>>> > ~ >>>> > ~ Off by default so that new clusters and upgraders from 0.4 don't >>>> > ~ bootstrap immediately. You should turn this on when you start >>>> adding >>>> > ~ new nodes to a cluster that already has data on it. (If you are >>>> > upgrading >>>> > ~ from 0.4, start your cluster with it off once before changing it >>>> to >>>> > true. >>>> > ~ Otherwise, no data will be lost but you will incur a lot of >>>> unnecessary >>>> > ~ I/O before your cluster starts up.) >>>> > --> >>>> > <AutoBootstrap>false</AutoBootstrap> >>>> > On Wed, Jan 5, 2011 at 4:58 PM, David Boxenhorn <da...@lookin2.com> >>>> wrote: >>>> >> >>>> >> If "seed list should be the same across the cluster" that means that >>>> nodes >>>> >> *should* have themselves as a seed. If that doesn't work for Ran, >>>> then that >>>> >> is the first problem, no? >>>> >> >>>> >> >>>> >> On Wed, Jan 5, 2011 at 3:56 PM, Jake Luciani <jak...@gmail.com> >>>> wrote: >>>> >>> >>>> >>> Well your ring issues don't make sense to me, seed list should be >>>> the >>>> >>> same across the cluster. >>>> >>> I'm just thinking of other things to try, non-boostrapped nodes >>>> should >>>> >>> join the ring instantly but reads will fail if you aren't using >>>> quorum. >>>> >>> >>>> >>> On Wed, Jan 5, 2011 at 8:51 AM, Ran Tavory <ran...@gmail.com> >>>> wrote: >>>> >>>> >>>> >>>> I haven't tried repair. Should I? >>>> >>>> >>>> >>>> On Jan 5, 2011 3:48 PM, "Jake Luciani" <jak...@gmail.com> wrote: >>>> >>>> > Have you tried not bootstrapping but setting the token and >>>> manually >>>> >>>> > calling >>>> >>>> > repair? >>>> >>>> > >>>> >>>> > On Wed, Jan 5, 2011 at 7:07 AM, Ran Tavory <ran...@gmail.com> >>>> wrote: >>>> >>>> > >>>> >>>> >> My conclusion is lame: I tried this on several hosts and saw the >>>> same >>>> >>>> >> behavior, the only way I was able to join new nodes was to first >>>> >>>> >> start them >>>> >>>> >> when they are *not in* their own seeds list and after they >>>> >>>> >> finish transferring the data, then restart them with themselves >>>> *in* >>>> >>>> >> their >>>> >>>> >> own seeds list. After doing that the node would join the ring. >>>> >>>> >> This is either my misunderstanding or a bug, but the only place >>>> I >>>> >>>> >> found it >>>> >>>> >> documented stated that the new node should not be in its own >>>> seeds >>>> >>>> >> list. >>>> >>>> >> Version 0.6.6. >>>> >>>> >> >>>> >>>> >> On Wed, Jan 5, 2011 at 10:35 AM, David Boxenhorn >>>> >>>> >> <da...@lookin2.com>wrote: >>>> >>>> >> >>>> >>>> >>> My nodes all have themselves in their list of seeds - always >>>> did - >>>> >>>> >>> and >>>> >>>> >>> everything works. (You may ask why I did this. I don't know, I >>>> must >>>> >>>> >>> have >>>> >>>> >>> copied it from an example somewhere.) >>>> >>>> >>> >>>> >>>> >>> On Wed, Jan 5, 2011 at 9:42 AM, Ran Tavory <ran...@gmail.com> >>>> wrote: >>>> >>>> >>> >>>> >>>> >>>> I was able to make the node join the ring but I'm confused. >>>> >>>> >>>> What I did is, first when adding the node, this node was not >>>> in the >>>> >>>> >>>> seeds >>>> >>>> >>>> list of itself. AFAIK this is how it's supposed to be. So it >>>> was >>>> >>>> >>>> able to >>>> >>>> >>>> transfer all data to itself from other nodes but then it >>>> stayed in >>>> >>>> >>>> the >>>> >>>> >>>> bootstrapping state. >>>> >>>> >>>> So what I did (and I don't know why it works), is add this >>>> node to >>>> >>>> >>>> the >>>> >>>> >>>> seeds list in its own storage-conf.xml file. Then restart the >>>> >>>> >>>> server and >>>> >>>> >>>> then I finally see it in the ring... >>>> >>>> >>>> If I had added the node to the seeds list of itself when first >>>> >>>> >>>> joining >>>> >>>> >>>> it, it would not join the ring but if I do it in two phases it >>>> did >>>> >>>> >>>> work. >>>> >>>> >>>> So it's either my misunderstanding or a bug... >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jan 5, 2011 at 7:14 AM, Ran Tavory <ran...@gmail.com> >>>> >>>> >>>> wrote: >>>> >>>> >>>> >>>> >>>> >>>>> The new node does not see itself as part of the ring, it sees >>>> all >>>> >>>> >>>>> others >>>> >>>> >>>>> but itself, so from that perspective the view is consistent. >>>> >>>> >>>>> The only problem is that the node never finishes to >>>> bootstrap. It >>>> >>>> >>>>> stays >>>> >>>> >>>>> in this state for hours (It's been 20 hours now...) >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> $ bin/nodetool -p 9004 -h localhost streams >>>> >>>> >>>>>> Mode: Bootstrapping >>>> >>>> >>>>>> Not sending any streams. >>>> >>>> >>>>>> Not receiving any streams. >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> On Wed, Jan 5, 2011 at 1:20 AM, Nate McCall < >>>> n...@riptano.com> >>>> >>>> >>>>> wrote: >>>> >>>> >>>>> >>>> >>>> >>>>>> Does the new node have itself in the list of seeds per >>>> chance? >>>> >>>> >>>>>> This >>>> >>>> >>>>>> could cause some issues if so. >>>> >>>> >>>>>> >>>> >>>> >>>>>> On Tue, Jan 4, 2011 at 4:10 PM, Ran Tavory < >>>> ran...@gmail.com> >>>> >>>> >>>>>> wrote: >>>> >>>> >>>>>> > I'm still at lost. I haven't been able to resolve this. I >>>> tried >>>> >>>> >>>>>> > adding another node at a different location on the ring >>>> but >>>> >>>> >>>>>> > this node >>>> >>>> >>>>>> > too remains stuck in the bootstrapping state for many >>>> hours >>>> >>>> >>>>>> > without >>>> >>>> >>>>>> > any of the other nodes being busy with anti compaction or >>>> >>>> >>>>>> > anything >>>> >>>> >>>>>> > else. I don't know what's keeping it from finishing the >>>> >>>> >>>>>> > bootstrap,no >>>> >>>> >>>>>> > CPU, no io, files were already streamed so what is it >>>> waiting >>>> >>>> >>>>>> > for? >>>> >>>> >>>>>> > I read the release notes of 0.6.7 and 0.6.8 and there >>>> didn't >>>> >>>> >>>>>> > seem to >>>> >>>> >>>>>> > be anything addressing a similar issue so I figured there >>>> was >>>> >>>> >>>>>> > no >>>> >>>> >>>>>> point >>>> >>>> >>>>>> > in upgrading. But let me know if you think there is. >>>> >>>> >>>>>> > Or any other advice... >>>> >>>> >>>>>> > >>>> >>>> >>>>>> > On Tuesday, January 4, 2011, Ran Tavory <ran...@gmail.com >>>> > >>>> >>>> >>>>>> > wrote: >>>> >>>> >>>>>> >> Thanks Jake, but unfortunately the streams directory is >>>> empty >>>> >>>> >>>>>> >> so I >>>> >>>> >>>>>> don't think that any of the nodes is anti-compacting data >>>> right >>>> >>>> >>>>>> now or had >>>> >>>> >>>>>> been in the past 5 hours. It seems that all the data was >>>> already >>>> >>>> >>>>>> transferred >>>> >>>> >>>>>> to the joining host but the joining node, after having >>>> received >>>> >>>> >>>>>> the data >>>> >>>> >>>>>> would still remain in bootstrapping mode and not join the >>>> >>>> >>>>>> cluster. I'm not >>>> >>>> >>>>>> sure that *all* data was transferred (perhaps other nodes >>>> need to >>>> >>>> >>>>>> transfer >>>> >>>> >>>>>> more data) but nothing is actually happening so I assume all >>>> has >>>> >>>> >>>>>> been moved. >>>> >>>> >>>>>> >> Perhaps it's a configuration error from my part. Should I >>>> use >>>> >>>> >>>>>> >> I use >>>> >>>> >>>>>> AutoBootstrap=true ? Anything else I should look out for in >>>> the >>>> >>>> >>>>>> configuration file or something else? >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> On Tue, Jan 4, 2011 at 4:08 PM, Jake Luciani >>>> >>>> >>>>>> >> <jak...@gmail.com> >>>> >>>> >>>>>> wrote: >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> In 0.6, locate the node doing anti-compaction and look in >>>> the >>>> >>>> >>>>>> "streams" subdirectory in the keyspace data dir to monitor >>>> the >>>> >>>> >>>>>> anti-compaction progress (it puts new SSTables for >>>> bootstrapping >>>> >>>> >>>>>> node in >>>> >>>> >>>>>> there) >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> On Tue, Jan 4, 2011 at 8:01 AM, Ran Tavory < >>>> ran...@gmail.com> >>>> >>>> >>>>>> wrote: >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> Running nodetool decommission didn't help. Actually the >>>> node >>>> >>>> >>>>>> >> refused >>>> >>>> >>>>>> to decommission itself (b/c it wasn't part of the ring). So >>>> I >>>> >>>> >>>>>> simply stopped >>>> >>>> >>>>>> the process, deleted all the data directories and started it >>>> >>>> >>>>>> again. It >>>> >>>> >>>>>> worked in the sense of the node bootstrapped again but as >>>> before, >>>> >>>> >>>>>> after it >>>> >>>> >>>>>> had finished moving the data nothing happened for a long >>>> time >>>> >>>> >>>>>> (I'm still >>>> >>>> >>>>>> waiting, but nothing seems to be happening). >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> Any hints how to analyze a "stuck" bootstrapping >>>> node??thanks >>>> >>>> >>>>>> >> On Tue, Jan 4, 2011 at 1:51 PM, Ran Tavory < >>>> ran...@gmail.com> >>>> >>>> >>>>>> wrote: >>>> >>>> >>>>>> >> Thanks Shimi, so indeed anticompaction was run on one of >>>> the >>>> >>>> >>>>>> >> other >>>> >>>> >>>>>> nodes from the same DC but to my understanding it has >>>> already >>>> >>>> >>>>>> ended. A few >>>> >>>> >>>>>> hour ago... >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> I plenty of log messages such as [1] which ended a couple >>>> of >>>> >>>> >>>>>> >> hours >>>> >>>> >>>>>> ago, and I've seen the new node streaming and accepting the >>>> data >>>> >>>> >>>>>> from the >>>> >>>> >>>>>> node which performed the anticompaction and so far it was >>>> normal >>>> >>>> >>>>>> so it >>>> >>>> >>>>>> seemed that data is at its right place. But now the new node >>>> >>>> >>>>>> seems sort of >>>> >>>> >>>>>> stuck. None of the other nodes is anticompacting right now >>>> or had >>>> >>>> >>>>>> been >>>> >>>> >>>>>> anticompacting since then. >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> The new node's CPU is close to zero, it's iostats are >>>> almost >>>> >>>> >>>>>> >> zero so >>>> >>>> >>>>>> I can't find another bottleneck that would keep it hanging. >>>> >>>> >>>>>> >> On the IRC someone suggested I'd maybe retry to join this >>>> >>>> >>>>>> >> node, >>>> >>>> >>>>>> e.g. decommission and rejoin it again. I'll try it now... >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> [1] INFO [COMPACTION-POOL:1] 2011-01-04 04:04:09,721 >>>> >>>> >>>>>> CompactionManager.java (line 338) AntiCompacting >>>> >>>> >>>>>> >>>> >>>> >>>>>> >>>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db')] >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> INFO [COMPACTION-POOL:1] 2011-01-04 04:34:18,683 >>>> >>>> >>>>>> CompactionManager.java (line 338) AntiCompacting >>>> >>>> >>>>>> >>>> >>>> >>>>>> >>>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3874-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3873-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3876-Data.db')] >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> INFO [COMPACTION-POOL:1] 2011-01-04 04:34:19,132 >>>> >>>> >>>>>> CompactionManager.java (line 338) AntiCompacting >>>> >>>> >>>>>> >>>> >>>> >>>>>> >>>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-951-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-976-Data.db'),org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvRatings-978-Data.db')] >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> INFO [COMPACTION-POOL:1] 2011-01-04 04:34:26,486 >>>> >>>> >>>>>> CompactionManager.java (line 338) AntiCompacting >>>> >>>> >>>>>> >>>> >>>> >>>>>> >>>> [org.apache.cassandra.io.SSTableReader(path='/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db')] >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> On Tue, Jan 4, 2011 at 12:45 PM, shimi < >>>> shim...@gmail.com> >>>> >>>> >>>>>> >> wrote: >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> In my experience most of the time it takes for a node to >>>> join >>>> >>>> >>>>>> >> the >>>> >>>> >>>>>> cluster is the anticompaction on the other nodes. The >>>> streaming >>>> >>>> >>>>>> part is very >>>> >>>> >>>>>> fast. >>>> >>>> >>>>>> >> Check the other nodes logs to see if there is any node >>>> doing >>>> >>>> >>>>>> anticompaction.I don't remember how much data I had in the >>>> >>>> >>>>>> cluster when I >>>> >>>> >>>>>> needed to add/remove nodes. I do remember that it took a few >>>> >>>> >>>>>> hours. >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> The node will join the ring only when it will finish the >>>> >>>> >>>>>> >> bootstrap. >>>> >>>> >>>>>> >> -- >>>> >>>> >>>>>> >> /Ran >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> >> >>>> >>>> >>>>>> > >>>> >>>> >>>>>> > -- >>>> >>>> >>>>>> > /Ran >>>> >>>> >>>>>> > >>>> >>>> >>>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> -- >>>> >>>> >>>>> /Ran >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> /Ran >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>>> >>>> >> >>>> >>>> >> >>>> >>>> >> -- >>>> >>>> >> /Ran >>>> >>>> >> >>>> >>>> >> >>>> >>> >>>> >> >>>> > >>>> > >>>> > >>>> > -- >>>> > /Ran >>>> > >>>> >>>> If non-auto-bootstrap nodes to not join they check to make sure good >>>> old iptables is not on. >>>> >>>> Edward >>>> >>> >>> >> >> >> -- >> /Ran >> >> > -- /Ran