No that's the only error. It may be that there were older errors which rolled out but I would expect a rolling restart would result in at least new errors but no. It's been 30 hours or so since the restart and still the same output from riak-admin transfers
I guess I might be able to stop all the nodes but I hate to do that as it degrades my service but I can only assume its already degraded? Is there functions I could run while attached to get more information? -Anthony On Feb 16, 2012, at 2:21 PM, Joseph Blomstedt <j...@basho.com> wrote: > Anthony, > > The primary partition warning suggests that a vnode that should be > running on your node is not running. After a restart, all primary > vnodes are started, which is why the primary warning goes away. > However, the fact that the warning re-appears is unexpected. The > handoff receiver errors also suggests that a vnode that was running > (and receiving handoff data) suddenly shutdown. This is likely the > same vnode. > > Are there any other details in your logs that would indicate why a > vnode is shutting down and/or crashing? Any messages about issues > reading in the Bitcask data for any vnodes? > > -Joe > > On Wed, Feb 15, 2012 at 12:30 PM, Anthony Molinaro > <antho...@alumni.caltech.edu> wrote: >> Hi, >> >> I have a 12 node riak cluster running riak 0.14.2. I had several nodes >> crash with OOM errors, and after restarting them I see the following when >> running riak-admin transfers >> >> Attempting to restart script through sudo -u riak >> 'riak@10.5.11.39' waiting to handoff 1 partitions >> 'riak@10.5.11.39' does not have 1 primary partitions running >> 'riak@10.5.11.37' waiting to handoff 1 partitions >> 'riak@10.5.11.37' does not have 1 primary partitions running >> 'riak@10.5.10.30' waiting to handoff 1 partitions >> 'riak@10.5.10.30' does not have 1 primary partitions running >> >> The only errors in the whole cluster are 2 errors on 10.5.10.30, both of >> the form >> >> >> =ERROR REPORT==== 15-Feb-2012::17:49:38 === >> Handoff receiver for partition >> 1044745311060762632934665329637030439832170528768 >> exiting abnormally after processing 7 objects: >> {timeout, >> {gen_fsm, >> sync_send_all_state_event, >> [<0.1299.1>, >> {handoff_data, >> >> <<141,144,203,78,2,49,20,134,207,12,76,4,77,12,49,38,38,174,221,54,25,102,64,134,189,151,141,168,33,6,217,145,127,218,142,157,138,29,3,229,9,120,3,119,190,133,91,151,238,92,243,68,182,74,162,184,162,77,78,206,245,59,253,187,27,204,15,78,178,110,154,138,62,151,44,73,58,49,139,139,172,199,210,12,9,227,167,73,12,153,246,179,162,43,143,95,107,75,181,35,168,49,155,84,185,150,220,62,17,81,48,247,118,171,249,169,111,87,53,65,205,217,132,87,198,74,99,85,83,80,93,148,220,34,68,203,221,6,110,17,171,150,254,119,84,240,55,247,205,241,230,200,175,222,27,179,97,137,71,54,186,195,3,122,24,59,66,31,6,23,232,224,25,9,46,113,239,162,129,243,135,46,51,69,14,141,54,82,156,109,144,2,79,58,92,147,174,48,183,108,80,137,178,40,165,80,181,156,40,106,231,20,209,75,78,225,232,195,141,249,230,214,242,71,78,136,243,252,230,154,175,214,48,21,250,98,157,150,246,221,149,2,19,209,74,191,125,238,235,95,161,193,246,66,55,159,40,40,226,83,9,227,64,118,182,144,90,187,143,92,24,33,139,210,72,241,5>>},60000]}} >> >> =ERROR REPORT==== 15-Feb-2012::17:49:41 === >> Handoff receiver for partition >> 1044745311060762632934665329637030439832170528768 >> exiting abnormally after processing 7 objects: >> {timeout, >> {gen_fsm, >> sync_send_all_state_event, >> [<0.1299.1>, >> {handoff_data, >> >> <<141,144,75,78,195,48,16,134,39,105,2,41,72,168,2,36,36,214,108,88,88,74,232,35,112,128,34,22,45,32,132,16,130,69,245,59,118,112,210,226,208,36,101,193,182,27,14,193,33,184,0,123,142,133,13,149,160,172,234,209,140,172,121,124,227,223,27,78,181,125,16,71,224,113,24,167,44,13,123,156,197,221,78,155,161,23,113,150,182,79,142,142,35,36,157,94,39,222,127,107,204,213,186,160,160,28,21,60,151,73,253,72,68,78,101,227,74,243,19,219,174,26,130,154,229,40,41,116,45,117,173,154,130,60,145,37,53,92,180,140,5,184,68,168,90,249,191,163,156,191,185,111,142,13,123,118,245,230,45,187,202,48,102,55,215,120,64,23,5,198,198,43,115,215,40,241,132,33,6,120,49,126,10,133,115,72,244,49,53,89,99,75,36,199,146,118,23,164,1,170,154,13,11,145,165,153,20,170,193,137,252,136,147,79,103,156,214,158,61,51,102,155,119,230,63,114,92,83,110,222,241,139,254,97,176,224,41,215,214,61,239,254,35,84,46,28,237,211,107,254,254,185,149,255,106,117,86,215,186,252,74,65,126,50,145,208,6,84,151,51,153,231,230,47,103,90,200,52,211,82,124,1>>},60000]}} >> >> I tried strobing through restarting all nodes, which seemed temporarily >> fix this particular node, but then I think this error cropped up. >> >> If there's anything I can try or more information I can give let me know. >> The boxes are 16 core, 24 GB memory, with data in bitcask on an SSD drive, >> there are 1024 partitions spread across 12 machines. Each machine does >> roughly 55-120K vnode gets per second, 20-40K node gets per second, 1-2K >> vnode puts, and 1-2K node puts. >> >> Thanks for the help, >> >> -Anthony >> >> -- >> ------------------------------------------------------------------------ >> Anthony Molinaro <antho...@alumni.caltech.edu> >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > -- > Joseph Blomstedt <j...@basho.com> > Software Engineer > Basho Technologies, Inc. > http://www.basho.com/ _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com