Hi - I've been testing a Riak cluster (of 3 nodes) with an ejabberd
messaging cluster in front of it that writes data to the Riak nodes. Whilst
load testing the platform (by creating 0.5 million ejabberd users via
Tsung), I found that the Riak nodes suddenly crashed. My question is how do
we recover from such a situation if it were to occur in production?

To provide further context / details, the leveldb log files storing the
data suddenly became too huge, thus making the AWS Riak instances not able
to load them in memory anymore. So we get a core dump if 'riak start' is
fired on those instances. I had an n_val = 2, and all 3 nodes went down
almost simultaneously, so in such a scenario, we cannot even rely on a 2nd
copy of the data. One way to of course prevent it in the first place would
be to use auto-scaling, but I'm wondering is there a ex post facto / post
the event recovery that can be performed in such a scenario? Is it possible
to simply copy the leveldb data to a larger memory instance, or to curtail
the data further to allow loading in the same instance?

Appreciate if you can provide inputs - a tad concerned as to how we could
recover from such a situation if it were to happen in production (apart
from leveraging auto-scaling as a preventive measure).

Thanks!
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to