What erlang version did you build with? How are you load balancing between the nodes? What kind of disks are you using?
On Thu, Aug 1, 2013 at 7:53 PM, Paul Ingalls <p...@fanzo.me> wrote: > FYI, 2 more nodes died with the end of the last test. Storm, which I'm > using to put data in, kills the topology a bit abruptly, perhaps the nodes > don't like a client going away like that? > > log from one of the nodes: > > 2013-08-02 02:27:23 =ERROR REPORT==== > Error in process <0.4959.0> on node 'riak@riak004' with exit value: > {badarg,[{riak_core_stat,vnodeq_len,1,[{file,"src/riak_core_stat.erl"},{line,181}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[{file,"src/riak_core_stat.erl"},{line,172}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[... > > 2013-08-02 02:27:33 =ERROR REPORT==== > Error in process <0.5055.0> on node 'riak@riak004' with exit value: > {badarg,[{riak_core_stat,vnodeq_len,1,[{file,"src/riak_core_stat.erl"},{line,181}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[{file,"src/riak_core_stat.erl"},{line,172}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[... > > 2013-08-02 02:27:51 =ERROR REPORT==== > Error in process <0.5228.0> on node 'riak@riak004' with exit value: > {badarg,[{riak_core_stat,vnodeq_len,1,[{file,"src/riak_core_stat.erl"},{line,181}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[{file,"src/riak_core_stat.erl"},{line,172}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[... > > and the log from the other node: > > 2013-08-02 00:09:39 =ERROR REPORT==== > Error in process <0.4952.0> on node 'riak@riak007' with exit value: > {badarg,[{riak_core_stat,vnodeq_len,1,[{file,"src/riak_core_stat.erl"},{line,181}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[{file,"src/riak_core_stat.erl"},{line,172}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[... > > 2013-08-02 00:09:44 =ERROR REPORT==== > ** State machine <0.2368.0> terminating > ** Last event in was unregistered > ** When State == active > ** Data == > {state,114179815416476790484662877555959610910619729920,riak_kv_vnode,{deleted,{state,114179815416476790484662877555959610910619729920,riak_kv_eleveldb_backend,{state,<<>>,"/mnt/datadrive/riak/data/leveldb/114179815416476790484662877555959610910619729920",[{create_if_missing,true},{max_open_files,128},{use_bloomfilter,true},{write_buffer_size,58858594}],[{add_paths,[]},{allow_strfun,false},{anti_entropy,{on,[]}},{anti_entropy_build_limit,{1,3600000}},{anti_entropy_concurrency,2},{anti_entropy_data_dir,"/mnt/datadrive/riak/data/anti_entropy"},{anti_entropy_expire,604800000},{anti_entropy_leveldb_opts,[{write_buffer_size,4194304},{max_open_files,20}]},{anti_entropy_tick,15000},{create_if_missing,true},{data_root,"/mnt/datadrive/riak/data/leveldb"},{fsm_limit,50000},{hook_js_vm_count,2},{http_url_encoding,on},{included_applications,[]},{js_max_vm_mem,8},{js_thread_stack,16},{legacy_stats,true},{listkeys_backpressure,true},{map_js_vm_count,8},{mapred_2i_pipe,true},{mapred_name,"mapred"},{max_open_files,128},{object_format,v1},{reduce_js_vm_count,6},{stats_urlpath,"stats"},{storage_backend,riak_kv_eleveldb_backend},{use_bloomfilter,true},{vnode_vclocks,true},{write_buffer_size,58858594}],[],[],[{fill_cache,false}],true,false},{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},undefined,3000,1000,100,100,true,true,undefined}},riak@riak003,none,undefined,undefined,undefined,{pool,riak_kv_worker,10,[]},undefined,107615} > ** Reason for termination = > ** > {badarg,[{eleveldb,close,[<<>>],[]},{riak_kv_eleveldb_backend,stop,1,[{file,"src/riak_kv_eleveldb_backend.erl"},{line,149}]},{riak_kv_vnode,terminate,2,[{file,"src/riak_kv_vnode.erl"},{line,836}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,847}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,586}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} > 2013-08-02 00:09:44 =CRASH REPORT==== > crasher: > initial call: riak_core_vnode:init/1 > pid: <0.2368.0> > registered_name: [] > exception exit: > {{badarg,[{eleveldb,close,[<<>>],[]},{riak_kv_eleveldb_backend,stop,1,[{file,"src/riak_kv_eleveldb_backend.erl"},{line,149}]},{riak_kv_vnode,terminate,2,[{file,"src/riak_kv_vnode.erl"},{line,836}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,847}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,586}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,589}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} > ancestors: [riak_core_vnode_sup,riak_core_sup,<0.139.0>] > messages: [] > links: [<0.142.0>] > dictionary: [{random_seed,{8115,23258,22987}}] > trap_exit: true > status: running > heap_size: 196418 > stack_size: 24 > reductions: 12124 > neighbours: > 2013-08-02 00:09:44 =SUPERVISOR REPORT==== > Supervisor: {local,riak_core_vnode_sup} > Context: child_terminated > Reason: > {badarg,[{eleveldb,close,[<<>>],[]},{riak_kv_eleveldb_backend,stop,1,[{file,"src/riak_kv_eleveldb_backend.erl"},{line,149}]},{riak_kv_vnode,terminate,2,[{file,"src/riak_kv_vnode.erl"},{line,836}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,847}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,586}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} > Offender: > [{pid,<0.2368.0>},{name,undefined},{mfargs,{riak_core_vnode,start_link,undefined}},{restart_type,temporary},{shutdown,300000},{child_type,worker}] > > > > Paul Ingalls > Founder & CEO Fanzo > p...@fanzo.me > @paulingalls > http://www.linkedin.com/in/paulingalls > > > > On Aug 1, 2013, at 7:49 PM, Paul Ingalls <p...@fanzo.me> wrote: > > I should say that I build riak from the master branch on the git repository. > Perhaps that was a bad idea? > > Paul Ingalls > Founder & CEO Fanzo > p...@fanzo.me > @paulingalls > http://www.linkedin.com/in/paulingalls > > > > On Aug 1, 2013, at 7:47 PM, Paul Ingalls <p...@fanzo.me> wrote: > > Thanks for the quick response Matthew! > > I gave that a shot, and if anything the performance was worse. When I > picked 128 I ran through the calculations on this page: > > http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/#Parameter-Planning > > and thought that would work, but it sounds like I was quite a bit off from > what you have below. > > Looking at risk control, the memory was staying pretty low, and watching top > the CPU was well in hand. iostat had very little of the CPU in iowait, > although it was writing a lot. I imagine, however, that this is missing a > lot of the details. > > Any other ideas? I can't imagine one get/update/put cycle per second is the > best I can do… > > Thanks! > > Paul Ingalls > Founder & CEO Fanzo > p...@fanzo.me > @paulingalls > http://www.linkedin.com/in/paulingalls > > > > On Aug 1, 2013, at 7:12 PM, Matthew Von-Maszewski <matth...@basho.com> > wrote: > > Try cutting your max open files in half. I am working from my iPad not my > workstation so my numbers are rough. Will get better ones to you in the > morning. > > The math goes like this: > > - vnode/partition heap usage is (4Mbytes * (max_open_files -10)) + 8Mbyte > - you have 18 vnodes per server (multiply the above times 18) > - AAE (active anti-entropy is"on") so that adds (4Mbyte* 10 + 8Mbyte) times > 18 vnodes > > The three lines above give the total memory leveldb will attempt to use per > server if your dataset is large enough to fill it. > > Matthew > > > On Aug 1, 2013, at 21:33, Paul Ingalls <p...@fanzo.me> wrote: > > I should add more details about the nodes that crashed. I ran this for the > first time for all of 10 minutes. > > Here is the log from the first one: > > 2013-08-02 00:09:44 =ERROR REPORT==== > ** State machine <0.2368.0> terminating > ** Last event in was unregistered > ** When State == active > ** Data == > {state,114179815416476790484662877555959610910619729920,riak_kv_vnode,{deleted,{state,114179815416476790484662877555959610910619729920,riak_kv_eleveldb_backend,{state,<<>>,"/mnt/datadrive/riak/data/leveldb/114179815416476790484662877555959610910619729920",[{create_if_missing,true},{max_open_files,128},{use_bloomfilter,true},{write_buffer_size,58858594}],[{add_paths,[]},{allow_strfun,false},{anti_entropy,{on,[]}},{anti_entropy_build_limit,{1,3600000}},{anti_entropy_concurrency,2},{anti_entropy_data_dir,"/mnt/datadrive/riak/data/anti_entropy"},{anti_entropy_expire,604800000},{anti_entropy_leveldb_opts,[{write_buffer_size,4194304},{max_open_files,20}]},{anti_entropy_tick,15000},{create_if_missing,true},{data_root,"/mnt/datadrive/riak/data/leveldb"},{fsm_limit,50000},{hook_js_vm_count,2},{http_url_encoding,on},{included_applications,[]},{js_max_vm_mem,8},{js_thread_stack,16},{legacy_stats,true},{listkeys_backpressure,true},{map_js_vm_count,8},{mapred_2i_pipe,true},{mapred_name,"mapred"},{max_open_files,128},{object_format,v1},{reduce_js_vm_count,6},{stats_urlpath,"stats"},{storage_backend,riak_kv_eleveldb_backend},{use_bloomfilter,true},{vnode_vclocks,true},{write_buffer_size,58858594}],[],[],[{fill_cache,false}],true,false},{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},undefined,3000,1000,100,100,true,true,undefined}},riak@riak003,none,undefined,undefined,undefined,{pool,riak_kv_worker,10,[]},undefined,107615} > ** Reason for termination = > ** > {badarg,[{eleveldb,close,[<<>>],[]},{riak_kv_eleveldb_backend,stop,1,[{file,"src/riak_kv_eleveldb_backend.erl"},{line,149}]},{riak_kv_vnode,terminate,2,[{file,"src/riak_kv_vnode.erl"},{line,836}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,847}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,586}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} > 2013-08-02 00:09:44 =CRASH REPORT==== > crasher: > initial call: riak_core_vnode:init/1 > pid: <0.2368.0> > registered_name: [] > exception exit: > {{badarg,[{eleveldb,close,[<<>>],[]},{riak_kv_eleveldb_backend,stop,1,[{file,"src/riak_kv_eleveldb_backend.erl"},{line,149}]},{riak_kv_vnode,terminate,2,[{file,"src/riak_kv_vnode.erl"},{line,836}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,847}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,586}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,589}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} > ancestors: [riak_core_vnode_sup,riak_core_sup,<0.139.0>] > messages: [] > links: [<0.142.0>] > dictionary: [{random_seed,{8115,23258,22987}}] > trap_exit: true > status: running > heap_size: 196418 > stack_size: 24 > reductions: 12124 > neighbours: > 2013-08-02 00:09:44 =SUPERVISOR REPORT==== > Supervisor: {local,riak_core_vnode_sup} > Context: child_terminated > Reason: > {badarg,[{eleveldb,close,[<<>>],[]},{riak_kv_eleveldb_backend,stop,1,[{file,"src/riak_kv_eleveldb_backend.erl"},{line,149}]},{riak_kv_vnode,terminate,2,[{file,"src/riak_kv_vnode.erl"},{line,836}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,847}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,586}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} > Offender: > [{pid,<0.2368.0>},{name,undefined},{mfargs,{riak_core_vnode,start_link,undefined}},{restart_type,temporary},{shutdown,300000},{child_type,worker}] > > The second one looks like it ran out of heap, I assume I have something miss > configured here... > > ===== Fri Aug 2 00:51:28 UTC 2013 > Erlang has closed > /home/fanzo/riak/rel/riak/bin/../lib/os_mon-2.2.9/priv/bin/memsup: Erlang > has closed. > ^M > Crash dump was written to: ./log/erl_crash.dump^M > eheap_alloc: Cannot allocate 5568010120 bytes of memory (of type "heap").^M > > > Paul Ingalls > Founder & CEO Fanzo > p...@fanzo.me > @paulingalls > http://www.linkedin.com/in/paulingalls > > > > On Aug 1, 2013, at 6:28 PM, Paul Ingalls <p...@fanzo.me> wrote: > > Couple of questions. > > I have migrated my system to use Riak on the back end. I have setup a 1.4 > cluster with 128 partitions on 7 nodes with LevelDB as the store. Each node > looks like: > > Azure Large instance (4CPU 7GB RAM) > data directory is on a RAID 0 > max files is set to 128 > async thread on the VM is 16 > everything else is defaults > > I'm using the 1.4.1 java client, connecting via the protocol buffer cluster. > > With this setup, I'm seeing poor throughput on my service load. I ran a > test for a bit and was seeing only a few gets/puts per second. And then > when I stopped the client two of the nodes crashed. > > I'm very new with Riak, so I figure I'm doing something wrong. I saw a note > on the list earlier of someone getting well over 1000 puts per second, so I > know it can move pretty fast. > > What is a good strategy for troubleshooting? > > How many fetch/update/store loops per second should I expect to see on a > cluster of this size? > > Thanks! > > Paul > > Paul Ingalls > Founder & CEO Fanzo > p...@fanzo.me > @paulingalls > http://www.linkedin.com/in/paulingalls > > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com