We just had it again (around this time of the day we have our highest user activity).

I will set +P to 131072 tomorrow, anything else I should check or change?

What about this memory-high-watermark which I get sporadically?

Ingo

Am 03.04.2013 17:57, schrieb Evan Vigil-McClanahan:
As for +P it's been raised in R16 (which is on the current man page)
on R15 it's only 32k.

The behavior that you're describing does sound like a very large
object getting put into the cluster (which may cause backups and push
you up against the process limit, could have caused scheduler collapse
on 1.2, etc.).

On Wed, Apr 3, 2013 at 8:39 AM, Ingo Rockel
<ingo.roc...@bluelionmobile.com> wrote:
Evan,

sys_process_count is somewhere between 5k and 11k on the nodes right now.
Concerning your suggested +P config, according to the erlang-docs, the
default for this param already is 262144, so setting it to 655536 would in
fact lower it?

We chose the ring size to be able to handle growth which was the main reason
to switch from mysql to nosql/riak. We have 12 Nodes, so about 86 vnodes per
node.

No, we don't monitor object sizes, the majority of objects is very small
(below 200 bytes), but we have objects storing references to this small
objects which might grow to a few megabytes in size, most of these are paged
though and should not exceed one megabyte. Only one type is not paged
(implementation reasons).

The outgoing/incoming traffic constantly is around 100 Mbit, if the
peformance drops happen, we suddenly see spikes up to 1GBit. And these
spikes constantly happen on three nodes as long as the performance drop
exists.

Ingo

Am 03.04.2013 17:12, schrieb Evan Vigil-McClanahan:

Ingo,

riak-admin status | grep sys_process_count

will tell you how many processes are running.  The default process
limit on erlang is a little low, and we'd suggest raising in
(especially with your extra-large ring_size).   Erlang processes are
cheap, so 65535 or even double that will be fine.

Busy dist ports are still worrying.  Are you monitoring object sizes?
Are there any spikes there associated with performance drops?

On Wed, Apr 3, 2013 at 8:03 AM, Ingo Rockel
<ingo.roc...@bluelionmobile.com> wrote:

Hi Evan,

I set swt very_low and zdbbl to 64MB, setting this params helped reducing
the busy_dist_port and Monitor got {suppressed,... Messages a lot. But
when
the performance of the cluster suddenly drops we still see these
messages.

The cluster was updated to 1.3 in the meantime.

The eleveldb section:

   %% eLevelDB Config
   {eleveldb, [
               {data_root, "/var/lib/riak/leveldb"},
               {cache_size, 33554432},
               {write_buffer_size_min, 67108864}, %% 64 MB in bytes
               {write_buffer_size_max, 134217728}, %% 128 MB in bytes
               {max_open_files, 4000}
              ]},

the ring size is 1024 and the machines have 48GB of memory. Concerning
the
params from vm.args:

-env ERL_MAX_PORTS 4096
-env ERL_MAX_ETS_TABLES 8192

+P isn't set

Ingo

Am 03.04.2013 16:53, schrieb Evan Vigil-McClanahan:

For your prior mail, I thought that someone had answered.  Our initial
suggestion was to add +swt very_low to your vm.args, as well as
setting the +zdbbl setting that Jon recommended in the list post you
pointed to.  If those help, moving to 1.3 should help more.

Other limits in vm.args that can cause problems are +P, ERL_MAX_PORTS,
and  ERL_MAX_ETS_TABLES.  Are any of these set?  If so, to what?

Can you also pate the eleveldb section of your app.config?

On Wed, Apr 3, 2013 at 7:41 AM, Ingo Rockel
<ingo.roc...@bluelionmobile.com> wrote:


Hi Evan,

I'm not sure, I find a lot of these:

2013-03-30 23:27:52.992 [error]
<0.8036.323>@riak_api_pb_server:handle_info:141 Unrecognized message
{22243034,{error,timeout}}

and some of these at the same time one of the kind below gets logged
(although the one has a different time stamp):

2013-03-30 23:27:53.056 [error] <0.9457.323>@riak_kv_console:status:178
Status failed error:terminated

Ingo

Am 03.04.2013 16:24, schrieb Evan Vigil-McClanahan:

Resending to the list:

Ingo,

That is an indication that the protocol buffers server can't spawn a
put fsm, which means that a put cannot be done for some reason or
another.  Are there any other messages that appear around this time
that might indicate why?

On Wed, Apr 3, 2013 at 12:09 AM, Ingo Rockel
<ingo.roc...@bluelionmobile.com> wrote:



Hi,

we have some performance issues with our riak cluster, from time to
time
we
have a sudden drop in performance (already asked the list about this,
no-one
had an idea though). Although not the same time but on the
problematic
nodes
we have a lot of these messages from time to time:

2013-04-02 21:41:11.173 [warning] <0.25646.475> ** Can not start
proc_lib:init_p



,[<0.14556.474>,[<0.9519.474>,riak_api_pb_sup,riak_api_sup,<0.1291.0>],riak_kv_p



ut_fsm,start_link,[{raw,65032165,<0.9519.474>},{r_object,<<109>>,<<77,115,124,49



,53,55,57,56,57,56,50,124,49,51,54,52,57,51,49,54,49,49,53,49,50,52,53,54>>,[{r_



content,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},



{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},<<>>}],[],{dict,2,16,16,8,8



0,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[]



,[],[],[[<<99,111,110,116,101,110,116,45,116,121,112,101>>,97,112,112,108,105,99



,97,116,105,111,110,47,106,115,111,110]],[],[],[],[],[[<<99,104,97,114,115,101,1



16>>,85,84,70,45,56]]}}},<<123,34,115,116,34,58,50,44,34,116,34,58,49,44,34,99,3



4,58,34,66,117,116,32,115,104,101,32,105,115,32,103,111,110,101,44,32,110,32,101



,118,101,110,32,116,104,111,117,103,104,32,105,109,32,110,111,116,32,105,110,32,



117,114,32,99,105,116,121,32,105,32,108,111,118,101,32,117,32,110,100,32,105,32,



109,101,97,110,32,105,116,32,58,39,40,34,44,34,114,34,58,49,52,51,52,54,52,51,57



,44,34,115,34,58,49,53,55,57,56,57,56,50,44,34,99,116,34,58,49,51,54,52,57,51,49



,54,49,49,53,49,50,44,34,97,110,34,58,102,97,108,115,101,44,34,115,107,34,58,49,



51,54,52,57,51,49,54,49,49,53,49,50,52,53,54,44,34,115,117,34,58,48,125>>},[{tim
eout,60000}]]] on 'riak@172.22.3.12' **

Can anyone explain to me what these messages mean and if I need to do
something about it? Could these messages be in any way related to the
performance issues?

Ingo

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





--
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

ingo.roc...@bluelionmobile.com


qeep: Hefferwolf



www.bluelionmobile.com
www.qeep.net




--
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

ingo.roc...@bluelionmobile.com

qeep: Hefferwolf


www.bluelionmobile.com
www.qeep.net



--
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

ingo.roc...@bluelionmobile.com
qeep: Hefferwolf

www.bluelionmobile.com
www.qeep.net


--
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

ingo.roc...@bluelionmobile.com
>>> qeep: Hefferwolf

www.bluelionmobile.com
www.qeep.net

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to