Hello!

We have 8 nodes with riak+riak-cs, about 7 Tb data in cluster.
Some time riak process dead by OOM on several nodes when large
(~100Gb) file was writed via s3cmd, because riak-cs eated all memory.
After recovering we experiencing the following problems:

1) very slow storage statistic calculation (about two hour). Before
oom it was done in 40 minutes.

2) 4000000+ files in bucket touristerru, no storage statistic counted:
riak-cs/error.log:
2014-08-11 00:12:28.096 [error]
<0.5987.49>@riak_cs_storage:maybe_sum_bucket:74 failed to calculate
usage of bucket 'touristerru' of user 'OF6DQ0FRBTEVLKGY-X0
P'. Reason: {error,<<"{\"phase\":0,\"error\":\"[{vnode_proxy_timeout,

After statistic request for this user we see some like this:

{u'Access': u'not_requested',
 u'Storage': {u'Errors': [],
  u'Samples': [{u'EndTime': u'20140811T091836Z',
    u'StartTime': u'20140811T091113Z',
    u'touristerru':
u'{error,<<"{\\"phase\\":0,\\"error\\":\\"[{vnode_proxy_timeout,{228359630832953580969325755111919221821239459840,\'riak@192.168.0.8\'}}]\\",\\"input\\":\\"{<<48,111,58,185,253,24,64,48,197,1,20,36,130,111,222,189,75,202,107>>,<<\\\\\\"files/3/8/6/5/3/5/2/clones/870_527_fixedwidth.jpg\\\\\\">>}\\",\\"type\\":\\"result\\",\\"stack\\":\\"[{gen,do_call,4,[{file,\\\\\\"gen.erl\\\\\\"},{line,234}]},{riak_core_vnode_proxy,call,2,[{file,\\\\\\"src/riak_core_vnode_proxy.erl\\\\\\"},{line,109}]},{riak_pipe_vnode,queue_work_send,4,[{file,\\\\\\"src/riak_pipe_vnode.erl\\\\\\"},{line,333}]},{riak_pipe_vnode,queue_work_erracc,6,[{file,\\\\\\"src/riak_pipe_vnode.erl\\\\\\"},{line,281}]},{riak_kv_pipe_get,process,3,[{file,\\\\\\"src/riak_kv_pipe_get.erl\\\\\\"},{line,92}]},{riak_pipe_vnode_worker,process_input,3,[{file,\\\\\\"src/riak_pipe_vnode_worker.erl\\\\\\"},{line,445}]},{riak_pipe_vnode_worker,wait_for_input,...},...]\\"}">>}'}]}}

3) Crash calculation process:
riak-cs/console.log
2014-08-11 09:22:51.580 [warning]
<0.24095.1>@riak_cs_storage_d:read_storage_schedule1:300 No storage
schedule defined. Calculation must be triggered manually.
2014-08-11 09:22:51.580 [error] <0.438.0> gen_fsm riak_cs_storage_d in
state calculating terminated with reason: no match of right hand value
false in riak_cs_storage:sum_bucket/1 line 104
2014-08-11 09:22:51.580 [error] <0.438.0> CRASH REPORT Process
riak_cs_storage_d with 1 neighbours exited with reason: no match of
right hand value false in riak_cs_storage:sum_bucket/1 line 104 in
gen_fsm:terminate/7 line 611
2014-08-11 09:22:51.581 [error] <0.153.0> Supervisor riak_cs_sup had
child riak_cs_storage_d started with riak_cs_storage_d:start_link() at
<0.438.0> exit with reason no match of right hand value false in
riak_cs_storage:sum_bucket/1 line 104 in context child_terminated

What has been done:

System:
1) RAM upgrade from 30 to 61Gb on every node.
2) add some swap on additional ssd (only to avoid OOM, sysctl
vm.swappiness=0 is set)

Riak configs:
1) increase cache_size in backend config
2) set {mapred_reduce_phase_batch_size, 5000}
3) set {mapred_always_prereduce, true}

Riak-CS configs:
1) set {storage_archive_period, 14400}
2) upgrade to 1.5.0 from 1.4.8

Configs: http://ovh.to/iwTiMby
Last logs: http://ovh.to/AHaASw

I don't know what i must to do now.

-- 
Stanislav

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to