Hello! We have 8 nodes with riak+riak-cs, about 7 Tb data in cluster. Some time riak process dead by OOM on several nodes when large (~100Gb) file was writed via s3cmd, because riak-cs eated all memory. After recovering we experiencing the following problems:
1) very slow storage statistic calculation (about two hour). Before oom it was done in 40 minutes. 2) 4000000+ files in bucket touristerru, no storage statistic counted: riak-cs/error.log: 2014-08-11 00:12:28.096 [error] <0.5987.49>@riak_cs_storage:maybe_sum_bucket:74 failed to calculate usage of bucket 'touristerru' of user 'OF6DQ0FRBTEVLKGY-X0 P'. Reason: {error,<<"{\"phase\":0,\"error\":\"[{vnode_proxy_timeout, After statistic request for this user we see some like this: {u'Access': u'not_requested', u'Storage': {u'Errors': [], u'Samples': [{u'EndTime': u'20140811T091836Z', u'StartTime': u'20140811T091113Z', u'touristerru': u'{error,<<"{\\"phase\\":0,\\"error\\":\\"[{vnode_proxy_timeout,{228359630832953580969325755111919221821239459840,\'riak@192.168.0.8\'}}]\\",\\"input\\":\\"{<<48,111,58,185,253,24,64,48,197,1,20,36,130,111,222,189,75,202,107>>,<<\\\\\\"files/3/8/6/5/3/5/2/clones/870_527_fixedwidth.jpg\\\\\\">>}\\",\\"type\\":\\"result\\",\\"stack\\":\\"[{gen,do_call,4,[{file,\\\\\\"gen.erl\\\\\\"},{line,234}]},{riak_core_vnode_proxy,call,2,[{file,\\\\\\"src/riak_core_vnode_proxy.erl\\\\\\"},{line,109}]},{riak_pipe_vnode,queue_work_send,4,[{file,\\\\\\"src/riak_pipe_vnode.erl\\\\\\"},{line,333}]},{riak_pipe_vnode,queue_work_erracc,6,[{file,\\\\\\"src/riak_pipe_vnode.erl\\\\\\"},{line,281}]},{riak_kv_pipe_get,process,3,[{file,\\\\\\"src/riak_kv_pipe_get.erl\\\\\\"},{line,92}]},{riak_pipe_vnode_worker,process_input,3,[{file,\\\\\\"src/riak_pipe_vnode_worker.erl\\\\\\"},{line,445}]},{riak_pipe_vnode_worker,wait_for_input,...},...]\\"}">>}'}]}} 3) Crash calculation process: riak-cs/console.log 2014-08-11 09:22:51.580 [warning] <0.24095.1>@riak_cs_storage_d:read_storage_schedule1:300 No storage schedule defined. Calculation must be triggered manually. 2014-08-11 09:22:51.580 [error] <0.438.0> gen_fsm riak_cs_storage_d in state calculating terminated with reason: no match of right hand value false in riak_cs_storage:sum_bucket/1 line 104 2014-08-11 09:22:51.580 [error] <0.438.0> CRASH REPORT Process riak_cs_storage_d with 1 neighbours exited with reason: no match of right hand value false in riak_cs_storage:sum_bucket/1 line 104 in gen_fsm:terminate/7 line 611 2014-08-11 09:22:51.581 [error] <0.153.0> Supervisor riak_cs_sup had child riak_cs_storage_d started with riak_cs_storage_d:start_link() at <0.438.0> exit with reason no match of right hand value false in riak_cs_storage:sum_bucket/1 line 104 in context child_terminated What has been done: System: 1) RAM upgrade from 30 to 61Gb on every node. 2) add some swap on additional ssd (only to avoid OOM, sysctl vm.swappiness=0 is set) Riak configs: 1) increase cache_size in backend config 2) set {mapred_reduce_phase_batch_size, 5000} 3) set {mapred_always_prereduce, true} Riak-CS configs: 1) set {storage_archive_period, 14400} 2) upgrade to 1.5.0 from 1.4.8 Configs: http://ovh.to/iwTiMby Last logs: http://ovh.to/AHaASw I don't know what i must to do now. -- Stanislav _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com