Hi,

  I have a 12 node riak cluster running riak 0.14.2.  I had several nodes
crash with OOM errors, and after restarting them I see the following when
running riak-admin transfers

Attempting to restart script through sudo -u riak
'riak@10.5.11.39' waiting to handoff 1 partitions
'riak@10.5.11.39' does not have 1 primary partitions running
'riak@10.5.11.37' waiting to handoff 1 partitions
'riak@10.5.11.37' does not have 1 primary partitions running
'riak@10.5.10.30' waiting to handoff 1 partitions
'riak@10.5.10.30' does not have 1 primary partitions running

The only errors in the whole cluster are 2 errors on 10.5.10.30, both of
the form


=ERROR REPORT==== 15-Feb-2012::17:49:38 ===
Handoff receiver for partition
1044745311060762632934665329637030439832170528768
exiting abnormally after processing 7 objects:
{timeout,
 {gen_fsm,
  sync_send_all_state_event,
  [<0.1299.1>,
   {handoff_data,
    
<<141,144,203,78,2,49,20,134,207,12,76,4,77,12,49,38,38,174,221,54,25,102,64,134,189,151,141,168,33,6,217,145,127,218,142,157,138,29,3,229,9,120,3,119,190,133,91,151,238,92,243,68,182,74,162,184,162,77,78,206,245,59,253,187,27,204,15,78,178,110,154,138,62,151,44,73,58,49,139,139,172,199,210,12,9,227,167,73,12,153,246,179,162,43,143,95,107,75,181,35,168,49,155,84,185,150,220,62,17,81,48,247,118,171,249,169,111,87,53,65,205,217,132,87,198,74,99,85,83,80,93,148,220,34,68,203,221,6,110,17,171,150,254,119,84,240,55,247,205,241,230,200,175,222,27,179,97,137,71,54,186,195,3,122,24,59,66,31,6,23,232,224,25,9,46,113,239,162,129,243,135,46,51,69,14,141,54,82,156,109,144,2,79,58,92,147,174,48,183,108,80,137,178,40,165,80,181,156,40,106,231,20,209,75,78,225,232,195,141,249,230,214,242,71,78,136,243,252,230,154,175,214,48,21,250,98,157,150,246,221,149,2,19,209,74,191,125,238,235,95,161,193,246,66,55,159,40,40,226,83,9,227,64,118,182,144,90,187,143,92,24,33,139,210,72,241,5>>},60000]}}

=ERROR REPORT==== 15-Feb-2012::17:49:41 ===
Handoff receiver for partition
1044745311060762632934665329637030439832170528768
exiting abnormally after processing 7 objects: 
{timeout,
 {gen_fsm,
  sync_send_all_state_event,
  [<0.1299.1>,
   {handoff_data,
    
<<141,144,75,78,195,48,16,134,39,105,2,41,72,168,2,36,36,214,108,88,88,74,232,35,112,128,34,22,45,32,132,16,130,69,245,59,118,112,210,226,208,36,101,193,182,27,14,193,33,184,0,123,142,133,13,149,160,172,234,209,140,172,121,124,227,223,27,78,181,125,16,71,224,113,24,167,44,13,123,156,197,221,78,155,161,23,113,150,182,79,142,142,35,36,157,94,39,222,127,107,204,213,186,160,160,28,21,60,151,73,253,72,68,78,101,227,74,243,19,219,174,26,130,154,229,40,41,116,45,117,173,154,130,60,145,37,53,92,180,140,5,184,68,168,90,249,191,163,156,191,185,111,142,13,123,118,245,230,45,187,202,48,102,55,215,120,64,23,5,198,198,43,115,215,40,241,132,33,6,120,49,126,10,133,115,72,244,49,53,89,99,75,36,199,146,118,23,164,1,170,154,13,11,145,165,153,20,170,193,137,252,136,147,79,103,156,214,158,61,51,102,155,119,230,63,114,92,83,110,222,241,139,254,97,176,224,41,215,214,61,239,254,35,84,46,28,237,211,107,254,254,185,149,255,106,117,86,215,186,252,74,65,126,50,145,208,6,84,151,51,153,231,230,47,103,90,200,52,211,82,124,1>>},60000]}}

I tried strobing through restarting all nodes, which seemed temporarily
fix this particular node, but then I think this error cropped up.

If there's anything I can try or more information I can give let me know.
The boxes are 16 core, 24 GB memory, with data in bitcask on an SSD drive,
there are 1024 partitions spread across 12 machines.  Each machine does
roughly 55-120K vnode gets per second, 20-40K node gets per second, 1-2K
 vnode puts, and 1-2K node puts.

Thanks for the help,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <antho...@alumni.caltech.edu>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to