Four days passed and node is still joining.
I haven't tried to restart it (which would probably fix the issue) as I
would like to find out what was the real reason of that stall and what to
do
to avoid it in the future.
Any suggestions?

Daniel


On 27 June 2013 00:19, Daniel Iwan <iwan.dan...@gmail.com> wrote:

> Hi all
>
> I see node stalled at 'joining' for good 8 hours now:
> 3-node cluster v1.3.1, 512 vnodes (way too high but that's another
> matter), leveldb backend
> Cluster was originally 2-nodes only and after upgrading to 1.3.1 we
> attached another node
> No active transfers on the nodes at the moment, but from error log looks
> like vnode transfer stopped
>
> riak-admin member-status
> Attempting to restart script through sudo -H -u riak
> ================================= Membership
> ==================================
> Status     Ring    Pending    Node
>
> -------------------------------------------------------------------------------
> joining    33.2%      --      'riak@10.173.240.3'
> valid      33.6%      --      'riak@10.173.240.2'
> valid      33.2%      --      'riak@10.173.240.21'
> -------------------------------------------------------------------------------
> Valid:2 / Leaving:0 / Exiting:0 / Joining:1 / Down:0
>
>
>
> riak-admin ring-status
> Attempting to restart script through sudo -H -u riak
> ================================== Claimant
> ===================================
> Claimant:  'riak@10.173.240.21'
> Status:     up
> Ring Ready: true
>
> ============================== Ownership Handoff
> ==============================
> No pending changes.
>
> ============================== Unreachable Nodes
> ==============================
> All nodes are up and reachable
>
>
>
> Node that is joining shows
> *error.log*
> 2013-06-26 16:35:50.351 [error] <0.7222.0> gen_fsm <0.7222.0> in state
> ready terminated with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.7223.0>,stop]}}
> 2013-06-26 16:35:51.549 [error] <0.7222.0> CRASH REPORT Process <0.7222.0>
> with 0 neighbours exited with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.7223.0>,stop]}} in
> gen_fsm:handle_msg/7 line 511
> 2013-06-26 16:35:52.460 [error] <0.5316.0> gen_fsm <0.5316.0> in state
> ready terminated with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.5317.0>,stop]}}
> 2013-06-26 16:35:52.696 [error] <0.5316.0> CRASH REPORT Process <0.5316.0>
> with 0 neighbours exited with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.5317.0>,stop]}} in
> gen_fsm:handle_msg/7 line 511
> 2013-06-26 16:35:52.867 [error] <0.9088.0> gen_fsm <0.9088.0> in state
> ready terminated with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.9089.0>,stop]}}
> 2013-06-26 16:35:53.007 [error] <0.9088.0> CRASH REPORT Process <0.9088.0>
> with 0 neighbours exited with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.9089.0>,stop]}} in
> gen_fsm:handle_msg/7 line 511
> 2013-06-26 16:35:53.127 [error] <0.5794.0> gen_fsm <0.5794.0> in state
> ready terminated with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.5795.0>,stop]}}
> 2013-06-26 16:35:53.248 [error] <0.5794.0> CRASH REPORT Process <0.5794.0>
> with 0 neighbours exited with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.5795.0>,stop]}} in
> gen_fsm:handle_msg/7 line 511
> 2013-06-26 16:35:53.368 [error] <0.3990.0> gen_fsm <0.3990.0> in state
> ready terminated with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.3991.0>,stop]}}
> 2013-06-26 16:35:53.498 [error] <0.3990.0> CRASH REPORT Process <0.3990.0>
> with 0 neighbours exited with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.3991.0>,stop]}} in
> gen_fsm:handle_msg/7 line 511
> 2013-06-26 16:35:53.619 [error] <0.7017.0> gen_fsm <0.7017.0> in state
> ready terminated with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.7018.0>,stop]}}
> 2013-06-26 16:35:53.739 [error] <0.7017.0> CRASH REPORT Process <0.7017.0>
> with 0 neighbours exited with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.7018.0>,stop]}} in
> gen_fsm:handle_msg/7 line 511
> 2013-06-26 16:35:53.859 [error] <0.5120.0> gen_fsm <0.5120.0> in state
> ready terminated with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.5121.0>,stop]}}
> 2013-06-26 16:35:53.980 [error] <0.5120.0> CRASH REPORT Process <0.5120.0>
> with 0 neighbours exited with reason:
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.5121.0>,stop]}} in
> gen_fsm:handle_msg/7 line 511
> 2013-06-26 16:41:27.819 [error]
> <0.10046.0>@riak_core_handoff_sender:start_fold:226 hinted_handoff transfer
> of riak_kv_vnode from 'riak@10.173.240.3'
> 961964944883816959833284743408959721921971224576 to 'riak@10.173.240.2'
> 961964944883816959833284743408959721921971224576 failed because of
> exit:{noproc,{riak_core_gen_server,call,[{riak_kv_handoff_listener,'
> riak@10.173.240.2'},handoff_port,infinity]}}
> [{riak_core_gen_server,call,3,[{file,"src/riak_core_gen_server.erl"},{line,214}]},{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,84}]}]
> 2013-06-26 16:41:27.819 [error]
> <0.10045.0>@riak_core_handoff_sender:start_fold:226 hinted_handoff transfer
> of riak_kv_vnode from 'riak@10.173.240.3'
> 790695221759101774106290427075020305556041629696 to 'riak@10.173.240.2'
> 790695221759101774106290427075020305556041629696 failed because of
> exit:{noproc,{riak_core_gen_server,call,[{riak_kv_handoff_listener,'
> riak@10.173.240.2'},handoff_port,infinity]}}
> [{riak_core_gen_server,call,3,[{file,"src/riak_core_gen_server.erl"},{line,214}]},{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,84}]}]
> 2013-06-26 16:41:31.153 [error]
> <0.10061.0>@riak_core_handoff_sender:start_fold:226 hinted_handoff transfer
> of riak_kv_vnode from 'riak@10.173.240.3'
> 379647886259785328361504067873565706277810601984 to 'riak@10.173.240.2'
> 379647886259785328361504067873565706277810601984 failed because of
> exit:{noproc,{riak_core_gen_server,call,[{riak_kv_handoff_listener,'
> riak@10.173.240.2'},handoff_port,infinity]}}
> [{riak_core_gen_server,call,3,[{file,"src/riak_core_gen_server.erl"},{line,214}]},{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,84}]}]
> 2013-06-26 16:45:23.242 [error] emulator Error in process <0.25919.0> on
> node 'riak@10.173.240.3' with exit value:
> {badarg,[{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[{file,"src/riak_core_stat.erl"},{line,168}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[{file,"src/riak_core_stat.erl"},{line,169}]},{riak_core_stat,vnodeq_stats...
>
>
>
> *console log:*
> 2013-06-26 16:52:14.834 [info]
> <0.30751.0>@riak_core_handoff_receiver:process_message:99 Receiving handoff
> data for partition
> riak_kv_vnode:822094670998632891489572718402909198556462055424
> 2013-06-26 16:52:14.875 [info]
> <0.30751.0>@riak_core_handoff_receiver:handle_info:69 Handoff receiver for
> partition 822094670998632891489572718402909198556462055424 exited after
> processing 0 objects
> 2013-06-26 16:52:16.693 [info]
> <0.30763.0>@riak_core_handoff_receiver:process_message:99 Receiving handoff
> data for partition
> riak_kv_vnode:125597796958124469533129165311555572001681702912
> 2013-06-26 16:52:16.731 [info]
> <0.30763.0>@riak_core_handoff_receiver:handle_info:69 Handoff receiver for
> partition 125597796958124469533129165311555572001681702912 exited after
> processing 0 objects
> 2013-06-26 18:52:20.507 [info]
> <0.18639.3>@riak_kv_exchange_fsm:key_exchange:204 Repaired 2010 keys during
> active anti-entropy exchange of
> {436737793968023723603835506651545511733120466944,3} between
> {439592289353435643365952078590444502005885960192,'riak@10.173.240.3'}
> and {442446784738847563128068650529343492278651453440,'riak@10.173.240.21
> '}
> 2013-06-26 18:52:49.282 [info]
> <0.22964.3>@riak_kv_exchange_fsm:key_exchange:204 Repaired 1809 keys during
> active anti-entropy exchange of
> {439592289353435643365952078590444502005885960192,3} between
> {439592289353435643365952078590444502005885960192,'riak@10.173.240.3'}
> and {442446784738847563128068650529343492278651453440,'riak@10.173.240.21
> '}
> 2013-06-26 21:43:21.418 [info]
> <0.17334.7>@riak_kv_exchange_fsm:key_exchange:204 Repaired 1809 keys during
> active anti-entropy exchange of
> {48526421552002635955981722961282834637013385216,3} between
> {51380916937414555718098294900181824909778878464,'riak@10.173.240.3'} and
> {54235412322826475480214866839080815182544371712,'riak@10.173.240.21'}
> 2013-06-26 21:43:51.259 [info]
> <0.21268.7>@riak_kv_exchange_fsm:key_exchange:204 Repaired 1965 keys during
> active anti-entropy exchange of
> {51380916937414555718098294900181824909778878464,3} between
> {51380916937414555718098294900181824909778878464,'riak@10.173.240.3'} and
> {54235412322826475480214866839080815182544371712,'riak@10.173.240.21'}
> 2013-06-26 22:22:22.352 [info]
> <0.20533.8>@riak_kv_exchange_fsm:key_exchange:204 Repaired 1961 keys during
> active anti-entropy exchange of
> {271177061614132377401074334195404075912721858560,3} between
> {274031556999544297163190906134303066185487351808,'riak@10.173.240.3'}
> and {276886052384956216925307478073202056458252845056,'riak@10.173.240.2'}
> 2013-06-26 22:22:51.652 [info]
> <0.24761.8>@riak_kv_exchange_fsm:key_exchange:204 Repaired 2010 keys during
> active anti-entropy exchange of
> {274031556999544297163190906134303066185487351808,3} between
> {274031556999544297163190906134303066185487351808,'riak@10.173.240.3'}
> and {276886052384956216925307478073202056458252845056,'riak@10.173.240.2'}
>
>
> *riak-admin status*
> Attempting to restart script through sudo -H -u riak
> 1-minute stats for 'riak@10.173.240.3'
> -------------------------------------------
> riak_kv_stat_ts : 1372288487
> vnode_gets : 254
> vnode_gets_total : 159164
> vnode_puts : 0
> vnode_puts_total : 42811
> vnode_index_reads : 0
> vnode_index_reads_total : 3706
> vnode_index_writes : 0
> vnode_index_writes_total : 42811
> vnode_index_writes_postings : 0
> vnode_index_writes_postings_total : 0
> vnode_index_deletes : 0
> vnode_index_deletes_total : 0
> vnode_index_deletes_postings : 0
> vnode_index_deletes_postings_total : 0
> node_gets : 95
> node_gets_total : 58191
> node_get_fsm_siblings_mean : 0
> node_get_fsm_siblings_median : 0
> node_get_fsm_siblings_95 : 0
> node_get_fsm_siblings_99 : 0
> node_get_fsm_siblings_100 : 0
> node_get_fsm_objsize_mean : 0
> node_get_fsm_objsize_median : 0
> node_get_fsm_objsize_95 : 0
> node_get_fsm_objsize_99 : 0
> node_get_fsm_objsize_100 : 0
> node_get_fsm_time_mean : 1266
> node_get_fsm_time_median : 1309
> node_get_fsm_time_95 : 1526
> node_get_fsm_time_99 : 2097
> node_get_fsm_time_100 : 3879
> node_puts : 0
> node_puts_total : 192
> node_put_fsm_time_mean : 0
> node_put_fsm_time_median : 0
> node_put_fsm_time_95 : 0
> node_put_fsm_time_99 : 0
> node_put_fsm_time_100 : 0
> read_repairs : 0
> read_repairs_total : 15325
> coord_redirs_total : 126
> executing_mappers : 0
> precommit_fail : 0
> postcommit_fail : 0
> pbc_active : 0
> pbc_connects : 8
> pbc_connects_total : 2834
> read_repairs_primary_notfound_one : 0
> read_repairs_primary_notfound_count : 23123
> read_repairs_fallback_notfound_one : 0
> read_repairs_fallback_notfound_count : 1869
> leveldb_read_block_error : 0
> riak_pipe_stat_ts : 1372288486
> pipeline_active : 0
> pipeline_create_count : 0
> pipeline_create_one : 0
> pipeline_create_error_count : 0
> pipeline_create_error_one : 0
> cpu_nprocs : 418
> cpu_avg1 : 5
> cpu_avg5 : 18
> cpu_avg15 : 26
> mem_total : 2082766848
> mem_allocated : 938852352
> disk : [{"/",974010980,1},
>         {"/dev",1008180,1},
>         {"/run",406792,1},
>         {"/run/lock",5120,0},
>         {"/run/shm",1016976,0},
>         {"/boot",186663,17}]
> nodename : 'riak@10.173.240.3'
> connected_nodes : ['riak@10.173.240.21','riak@10.173.240.2']
> sys_driver_version : <<"2.0">>
> sys_global_heaps_size : 0
> sys_heap_type : private
> sys_logical_processors : 4
> sys_otp_release : <<"R15B01">>
> sys_process_count : 4665
> sys_smp_support : true
> sys_system_version : <<"Erlang R15B01 (erts-5.9.1) [source] [64-bit]
> [smp:4:4] [async-threads:64] [kernel-poll:true]">>
>  sys_system_architecture : <<"x86_64-unknown-linux-gnu">>
> sys_threads_enabled : true
> sys_thread_pool_size : 64
> sys_wordsize : 8
> ring_members : ['riak@10.173.240.2','riak@10.173.240.21','
> riak@10.173.240.3']
> ring_num_partitions : 512
> ring_ownership : <<"[{'riak@10.173.240.21',170},\n 
> {'riak@10.173.240.2',172},\n
> {'riak@10.173.240.3',170}]">>
> ring_creation_size : 512
> storage_backend : riak_kv_eleveldb_backend
> erlydtl_version : <<"0.7.0">>
> riak_control_version : <<"1.3.0">>
> cluster_info_version : <<"1.2.3">>
> riak_search_version : <<"1.3.0">>
> merge_index_version : <<"1.3.0">>
> riak_kv_version : <<"1.3.1">>
> riak_api_version : <<"1.3.1">>
> riak_pipe_version : <<"1.3.1">>
> riak_core_version : <<"1.3.1">>
> bitcask_version : <<"1.6.1">>
> basho_stats_version : <<"1.0.3">>
> webmachine_version : <<"1.9.3">>
> mochiweb_version : <<"1.5.1p3">>
> inets_version : <<"5.9">>
> erlang_js_version : <<"1.2.2">>
> runtime_tools_version : <<"1.8.8">>
> os_mon_version : <<"2.2.9">>
> riak_sysmon_version : <<"1.1.3">>
> ssl_version : <<"5.0.1">>
> public_key_version : <<"0.15">>
> crypto_version : <<"2.1">>
> sasl_version : <<"2.2.1">>
> lager_version : <<"1.2.2">>
> syntax_tools_version : <<"1.6.8">>
> compiler_version : <<"4.8.1">>
> stdlib_version : <<"1.18.1">>
> kernel_version : <<"2.15.1">>
> memory_total : 75985000
> memory_processes : 44614494
> memory_processes_used : 44614480
> memory_system : 31370506
> memory_atom : 569961
> memory_atom_used : 539061
> memory_binary : 3642608
> memory_code : 10232207
> memory_ets : 4826952
>
>
> *Data size in Riak:*
>
>  du -h --max-depth=1 /var/lib/riak/
> ***** 10.0.20.21 *****
> 0 /var/lib/riak/bitcask
> 132K /var/lib/riak/ring
> 8.0K /var/lib/riak/mr_queue
> 2.1M /var/lib/riak/kv_vnode
> 899M /var/lib/riak/leveldb
> 180M /var/lib/riak/anti_entropy
> 1.1G /var/lib/riak/
> ***** 10.0.20.22 *****
> 0 /var/lib/riak/bitcask
> 132K /var/lib/riak/ring
> 8.0K /var/lib/riak/mr_queue
> 2.1M /var/lib/riak/kv_vnode
> 292M /var/lib/riak/leveldb
> 220M /var/lib/riak/anti_entropy
> 513M /var/lib/riak/
> ***** 10.0.20.23 *****
> 0 /var/lib/riak/bitcask
> 64K /var/lib/riak/ring
> 2.1M /var/lib/riak/kv_vnode
> 53M /var/lib/riak/leveldb
> 23M /var/lib/riak/anti_entropy
> 77M /var/lib/riak/
>
>
>
>
>
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to