Hi, I have a 12 node riak cluster running riak 0.14.2. I had several nodes crash with OOM errors, and after restarting them I see the following when running riak-admin transfers
Attempting to restart script through sudo -u riak 'riak@10.5.11.39' waiting to handoff 1 partitions 'riak@10.5.11.39' does not have 1 primary partitions running 'riak@10.5.11.37' waiting to handoff 1 partitions 'riak@10.5.11.37' does not have 1 primary partitions running 'riak@10.5.10.30' waiting to handoff 1 partitions 'riak@10.5.10.30' does not have 1 primary partitions running The only errors in the whole cluster are 2 errors on 10.5.10.30, both of the form =ERROR REPORT==== 15-Feb-2012::17:49:38 === Handoff receiver for partition 1044745311060762632934665329637030439832170528768 exiting abnormally after processing 7 objects: {timeout, {gen_fsm, sync_send_all_state_event, [<0.1299.1>, {handoff_data, <<141,144,203,78,2,49,20,134,207,12,76,4,77,12,49,38,38,174,221,54,25,102,64,134,189,151,141,168,33,6,217,145,127,218,142,157,138,29,3,229,9,120,3,119,190,133,91,151,238,92,243,68,182,74,162,184,162,77,78,206,245,59,253,187,27,204,15,78,178,110,154,138,62,151,44,73,58,49,139,139,172,199,210,12,9,227,167,73,12,153,246,179,162,43,143,95,107,75,181,35,168,49,155,84,185,150,220,62,17,81,48,247,118,171,249,169,111,87,53,65,205,217,132,87,198,74,99,85,83,80,93,148,220,34,68,203,221,6,110,17,171,150,254,119,84,240,55,247,205,241,230,200,175,222,27,179,97,137,71,54,186,195,3,122,24,59,66,31,6,23,232,224,25,9,46,113,239,162,129,243,135,46,51,69,14,141,54,82,156,109,144,2,79,58,92,147,174,48,183,108,80,137,178,40,165,80,181,156,40,106,231,20,209,75,78,225,232,195,141,249,230,214,242,71,78,136,243,252,230,154,175,214,48,21,250,98,157,150,246,221,149,2,19,209,74,191,125,238,235,95,161,193,246,66,55,159,40,40,226,83,9,227,64,118,182,144,90,187,143,92,24,33,139,210,72,241,5>>},60000]}} =ERROR REPORT==== 15-Feb-2012::17:49:41 === Handoff receiver for partition 1044745311060762632934665329637030439832170528768 exiting abnormally after processing 7 objects: {timeout, {gen_fsm, sync_send_all_state_event, [<0.1299.1>, {handoff_data, <<141,144,75,78,195,48,16,134,39,105,2,41,72,168,2,36,36,214,108,88,88,74,232,35,112,128,34,22,45,32,132,16,130,69,245,59,118,112,210,226,208,36,101,193,182,27,14,193,33,184,0,123,142,133,13,149,160,172,234,209,140,172,121,124,227,223,27,78,181,125,16,71,224,113,24,167,44,13,123,156,197,221,78,155,161,23,113,150,182,79,142,142,35,36,157,94,39,222,127,107,204,213,186,160,160,28,21,60,151,73,253,72,68,78,101,227,74,243,19,219,174,26,130,154,229,40,41,116,45,117,173,154,130,60,145,37,53,92,180,140,5,184,68,168,90,249,191,163,156,191,185,111,142,13,123,118,245,230,45,187,202,48,102,55,215,120,64,23,5,198,198,43,115,215,40,241,132,33,6,120,49,126,10,133,115,72,244,49,53,89,99,75,36,199,146,118,23,164,1,170,154,13,11,145,165,153,20,170,193,137,252,136,147,79,103,156,214,158,61,51,102,155,119,230,63,114,92,83,110,222,241,139,254,97,176,224,41,215,214,61,239,254,35,84,46,28,237,211,107,254,254,185,149,255,106,117,86,215,186,252,74,65,126,50,145,208,6,84,151,51,153,231,230,47,103,90,200,52,211,82,124,1>>},60000]}} I tried strobing through restarting all nodes, which seemed temporarily fix this particular node, but then I think this error cropped up. If there's anything I can try or more information I can give let me know. The boxes are 16 core, 24 GB memory, with data in bitcask on an SSD drive, there are 1024 partitions spread across 12 machines. Each machine does roughly 55-120K vnode gets per second, 20-40K node gets per second, 1-2K vnode puts, and 1-2K node puts. Thanks for the help, -Anthony -- ------------------------------------------------------------------------ Anthony Molinaro <antho...@alumni.caltech.edu> _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com