Re: Nodes missing primary partitions

Anthony Molinaro Thu, 16 Feb 2012 18:20:42 -0800

No that's the only error.  It may be that there were older errors which rolled 
out but I would expect a rolling restart would result in at least new errors 
but no.  It's been 30 hours or so since the restart and still the same output 
from riak-admin transfers


I guess I might be able to stop all the nodes but I hate to do that as it 
degrades my service but I can only assume its already degraded?

Is there functions I could run while attached to get more information?

-Anthony

On Feb 16, 2012, at 2:21 PM, Joseph Blomstedt <j...@basho.com> wrote:

> Anthony,
> 
> The primary partition warning suggests that a vnode that should be
> running on your node is not running. After a restart, all primary
> vnodes are started, which is why the primary warning goes away.
> However, the fact that the warning re-appears is unexpected. The
> handoff receiver errors also suggests that a vnode that was running
> (and receiving handoff data) suddenly shutdown. This is likely the
> same vnode.
> 
> Are there any other details in your logs that would indicate why a
> vnode is shutting down and/or crashing? Any messages about issues
> reading in the Bitcask data for any vnodes?
> 
> -Joe
> 
> On Wed, Feb 15, 2012 at 12:30 PM, Anthony Molinaro
> <antho...@alumni.caltech.edu> wrote:
>> Hi,
>> 
>> I have a 12 node riak cluster running riak 0.14.2.  I had several nodes
>> crash with OOM errors, and after restarting them I see the following when
>> running riak-admin transfers
>> 
>> Attempting to restart script through sudo -u riak
>> 'riak@10.5.11.39' waiting to handoff 1 partitions
>> 'riak@10.5.11.39' does not have 1 primary partitions running
>> 'riak@10.5.11.37' waiting to handoff 1 partitions
>> 'riak@10.5.11.37' does not have 1 primary partitions running
>> 'riak@10.5.10.30' waiting to handoff 1 partitions
>> 'riak@10.5.10.30' does not have 1 primary partitions running
>> 
>> The only errors in the whole cluster are 2 errors on 10.5.10.30, both of
>> the form
>> 
>> 
>> =ERROR REPORT==== 15-Feb-2012::17:49:38 ===
>> Handoff receiver for partition
>> 1044745311060762632934665329637030439832170528768
>> exiting abnormally after processing 7 objects:
>> {timeout,
>> {gen_fsm,
>> sync_send_all_state_event,
>> [<0.1299.1>,
>>  {handoff_data,
>>   
>> <<141,144,203,78,2,49,20,134,207,12,76,4,77,12,49,38,38,174,221,54,25,102,64,134,189,151,141,168,33,6,217,145,127,218,142,157,138,29,3,229,9,120,3,119,190,133,91,151,238,92,243,68,182,74,162,184,162,77,78,206,245,59,253,187,27,204,15,78,178,110,154,138,62,151,44,73,58,49,139,139,172,199,210,12,9,227,167,73,12,153,246,179,162,43,143,95,107,75,181,35,168,49,155,84,185,150,220,62,17,81,48,247,118,171,249,169,111,87,53,65,205,217,132,87,198,74,99,85,83,80,93,148,220,34,68,203,221,6,110,17,171,150,254,119,84,240,55,247,205,241,230,200,175,222,27,179,97,137,71,54,186,195,3,122,24,59,66,31,6,23,232,224,25,9,46,113,239,162,129,243,135,46,51,69,14,141,54,82,156,109,144,2,79,58,92,147,174,48,183,108,80,137,178,40,165,80,181,156,40,106,231,20,209,75,78,225,232,195,141,249,230,214,242,71,78,136,243,252,230,154,175,214,48,21,250,98,157,150,246,221,149,2,19,209,74,191,125,238,235,95,161,193,246,66,55,159,40,40,226,83,9,227,64,118,182,144,90,187,143,92,24,33,139,210,72,241,5>>},60000]}}
>> 
>> =ERROR REPORT==== 15-Feb-2012::17:49:41 ===
>> Handoff receiver for partition
>> 1044745311060762632934665329637030439832170528768
>> exiting abnormally after processing 7 objects:
>> {timeout,
>> {gen_fsm,
>> sync_send_all_state_event,
>> [<0.1299.1>,
>>  {handoff_data,
>>   
>> <<141,144,75,78,195,48,16,134,39,105,2,41,72,168,2,36,36,214,108,88,88,74,232,35,112,128,34,22,45,32,132,16,130,69,245,59,118,112,210,226,208,36,101,193,182,27,14,193,33,184,0,123,142,133,13,149,160,172,234,209,140,172,121,124,227,223,27,78,181,125,16,71,224,113,24,167,44,13,123,156,197,221,78,155,161,23,113,150,182,79,142,142,35,36,157,94,39,222,127,107,204,213,186,160,160,28,21,60,151,73,253,72,68,78,101,227,74,243,19,219,174,26,130,154,229,40,41,116,45,117,173,154,130,60,145,37,53,92,180,140,5,184,68,168,90,249,191,163,156,191,185,111,142,13,123,118,245,230,45,187,202,48,102,55,215,120,64,23,5,198,198,43,115,215,40,241,132,33,6,120,49,126,10,133,115,72,244,49,53,89,99,75,36,199,146,118,23,164,1,170,154,13,11,145,165,153,20,170,193,137,252,136,147,79,103,156,214,158,61,51,102,155,119,230,63,114,92,83,110,222,241,139,254,97,176,224,41,215,214,61,239,254,35,84,46,28,237,211,107,254,254,185,149,255,106,117,86,215,186,252,74,65,126,50,145,208,6,84,151,51,153,231,230,47,103,90,200,52,211,82,124,1>>},60000]}}
>> 
>> I tried strobing through restarting all nodes, which seemed temporarily
>> fix this particular node, but then I think this error cropped up.
>> 
>> If there's anything I can try or more information I can give let me know.
>> The boxes are 16 core, 24 GB memory, with data in bitcask on an SSD drive,
>> there are 1024 partitions spread across 12 machines.  Each machine does
>> roughly 55-120K vnode gets per second, 20-40K node gets per second, 1-2K
>> vnode puts, and 1-2K node puts.
>> 
>> Thanks for the help,
>> 
>> -Anthony
>> 
>> --
>> ------------------------------------------------------------------------
>> Anthony Molinaro                           <antho...@alumni.caltech.edu>
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> -- 
> Joseph Blomstedt <j...@basho.com>
> Software Engineer
> Basho Technologies, Inc.
> http://www.basho.com/


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Nodes missing primary partitions

Reply via email to