Undiagnosed High FSM Time
We have a 5 node Riak cluster running 2.1.1. This morning FSM Time (99th percentile) went way up. We couldn't find any clear signs of trouble with the cluster and ultimately chose to move the data files and restart the nodes. Once we started with an empty DB, the FSM Time normalized. But now it's headed back up again. We're stumped on how to trouble shoot this issue. Any suggestions? ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Undiagnosed High FSM Time
Thanks for your reply. We are. We sort of expected an anomaly in the object size, but there was none. We found the root cause. It was a large number of additions to a single set. It’s not clear to me which metric reveals that problem, but it appears as though object size doesn’t. Alex > On Jan 26, 2016, at 3:40 PM, Luke Bakken wrote: > > Hi Alex - > > Are you monitoring any of Riak's statistics? Specifically object size > and sibling count, though all of the stats are useful. > > -- > Luke Bakken > Engineer > lbak...@basho.com > > On Tue, Jan 26, 2016 at 11:40 AM, Alex Wolfe wrote: >> We have a 5 node Riak cluster running 2.1.1. This morning FSM Time (99th >> percentile) went way up. We couldn't find any clear signs of trouble with >> the cluster and ultimately chose to move the data files and restart the >> nodes. Once we started with an empty DB, the FSM Time normalized. But now >> it's headed back up again. We're stumped on how to trouble shoot this issue. >> Any suggestions? >> ___ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Write_lock error has occurred after inserting 12M data
$ lsof -p 16129 | awk '{print $9}'| uniq -c | grep lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/913438523331814323877303020447676887284957839360/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/959110449498405040071168171470060731649205731328/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/936274486415109681974235595958868809467081785344/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/411047335499316445744786359201454599278231027712/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/456719261665907161938651510223838443642478919680/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/433883298582611803841718934712646521460354973696/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/388211372416021087647853783690262677096107081728/bitcask.write.lock On Jul 30, 2010, at 6:03 PM, David Smith wrote: > Yup, that looks like the file handle leak. You can verify by using > lsof on the server and looking for multiple handles to > bitcask.write.lock. Something like: > > lsof -p pid | awk '{print $9}'| uniq -c > > D. > > On Friday, July 30, 2010, Alex Wolfe wrote: >> Hey David. >> Does the below log output look like it could be caused by the issue you >> fixed? >> Alex >> >> Fri Jul 30 14:22:34 CDT 2010 >> =ERROR REPORT 30-Jul-2010::14:22:34 ===** State machine <0.176.0> >> terminating ** Last event in was {riak_vnode_req_v1, >> 593735040165679310520246963290989976735222595584, >> {fsm,undefined,<0.12466.0>}, {riak_kv_put_req_v1, >> {<<"test.groups">>,<<"EghzXywWrGGtp2fCcSLoatIdjML">>}, >> {r_object,<<"test.groups">>, >> <<"EghzXywWrGGtp2fCcSLoatIdjML">>, [{r_content, >> {dict,5,16,16,8,80,48, >> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}, >> {{[],[],[[<<"Links">>]], >>[],[],[],[],[],[],[], >> [[<<"content-type">>,97,112,112,108,105,99,97, >> 116,105,111,110,47,106,115,111,110], >> [<<"X-Riak-VTag">>,89,69,78,55,55,111,66,121,73, >> 69,78,53,122,101,85,105,117,68,89,80,52]], >> [],[],[[<<"X-Riak-Last-Modified">>| >> {1280,517754,951062}]],[], >> [[<<"X-Riak-Meta">>]]}}}, >> <<"{\"name\":\"foo\",\"created_at\":\"2010-07-30T19:22:34.947Z\",\"type\":\"group\",\"version\":1}">>}], >>[{<<0,55,119,231>>,{1,63447736954}}], >> {dict,1,16,16,8,80,48, >> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}, >> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[], >> [[clean|true]], []}}}, >> undefined}, 33218311,63447736954, >> [{returnbody,true}]}}** When State == active** Data == >> {state,593735040165679310520246963290989976735222595584, >> riak_kv_vnode, >> {state,593735040165679310520246963290989976735222595584, >> riak_kv_bitcask_backend, >> {#Ref<0.0.0.611>, >> "data/bitcask/593735040165679310520246963290989976735222595584"}, >> [],false}, undefined,none}** Reason >> for termination = ** {{badmatch,{error,emfile}}, >> [{bitcask_fileops,create_file_loop,3},{bitcask,put,3}, >> {riak_kv_bitcask_backend,put,3},{riak_kv_vnode,perform_put,3}, >> {riak_kv_vnode,do_put,7},{riak_kv_vnode,handle_command,3}, >> {riak_core_vnode,vnode_command,3},{gen_fsm,handle_msg,7}]} >> =ERROR REPORT 30-Jul-2010::14:22:35 ===Failed to open lock file >> data/bitcask/5937
Re: Write_lock error has occurred after inserting 12M data
IIRC, that was a full paste of all the bitcask.write.locks. Riak fails pretty much immediately while running my test suite, maybe before a lock is opened for each partition? My ulimit was set to 256, which is obviously no good. After boosting it to 9000 and running my test suite, I have the locks shown below. Riak is still running. I guess that makes it an issue with max open files rather than a write lock issue? $ lsof -p 53113 | awk '{print $9}'| uniq -c | grep lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/1438665674247607560106752257205091097473808596992/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/22835963083295358096932575511191922182123945984/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/0/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/776422744832042175295707567380525354192214163456/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/730750818665451459101842416358141509827966271488/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/753586781748746817198774991869333432010090217472/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/1164634117248063262943561351070788031288321245184/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/1187470080331358621040493926581979953470445191168/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/1210306043414653979137426502093171875652569137152/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/45671926166590716193865151022383844364247891968/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/91343852333181432387730302044767688728495783936/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/137015778499772148581595453067151533092743675904/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/114179815416476790484662877555959610910619729920/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/182687704666362864775460604089535377456991567872/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/228359630832953580969325755111919221821239459840/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/205523667749658222872393179600727299639115513856/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/593735040165679310520246963290989976735222595584/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/639406966332270026714112114313373821099470487552/bitcask.write.lock 1 /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/616571003248974668617179538802181898917346541568/bitcask.write.lock On Jul 30, 2010, at 8:34 PM, David Smith wrote: > That's only a partial paste, correct? How many partitions > ({ring_creation_size, 64} in your etc/app.config) do you have defined? There > should be a write lock file open for each partition. Also, what is your > ulimit -n set to? > > Thanks, > > D. > > On Fri, Jul 30, 2010 at 5:09 PM, Alex Wolfe wrote: > $ lsof -p 16129 | awk '{print $9}'| uniq -c | grep lock > 1 > /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/913438523331814323877303020447676887284957839360/bitcask.write.lock > 1 > /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/959110449498405040071168171470060731649205731328/bitcask.write.lock > 1 > /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/936274486415109681974235595958868809467081785344/bitcask.write.lock > 1 > /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/411047335499316445744786359201454599278231027712/bitcask.write.lock > 1 > /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/456719261665907161938651510223838443642478919680/bitcask.write.lock > 1 > /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/433883298582611803841718934712646521460354973696/bitcask.write.lock > 1 > /usr/local/Cellar/riak/0.12.0/libexec/data/bitcask/388211372416021087647853783690262677096107081728/bitcask.write.lock > > > On Jul 30, 2010, at 6:03 PM, David Smith wrote: > > > Yup, that looks like the file handle leak. You can verify by using > > lsof on the server and looking for multiple handles to > > bitcask.write.lock. Something like: > > > > lsof -p pid | awk '{print $9}'| uniq -c > > > > D. > > > > On Friday, July 30, 2010, Alex Wolfe wrote: > >> Hey David. > >> Does the below log output look like it could be caused by the issue you > >> fixed? > >> Alex > >> > >> Fri Jul 30 14:22:34 CDT 2010 > >> =ERROR REPORT 30-Jul-2010::14:22:34 ===** State machine <0.176.0> > >> terminating *
Re: Pervasive replication
I've run into a problem with Riak on my development machine, and I can't quite sort out what's happening. I've tried stopping the riak processes and restarting it back up again, but it will not service any requests. Has anyone seen this before? $ curl -v -X POST http://riak:8098/riak/test -d'{"foo":"bar"}' -H 'Content-Type:application/json' * About to connect() to riak port 8098 (#0) * Trying ::1... Connection refused * Trying fe80::1... Connection refused * Trying 127.0.0.1... connected * Connected to riak (127.0.0.1) port 8098 (#0) > POST /riak/test HTTP/1.1 > User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 > OpenSSL/0.9.8l zlib/1.2.3 > Host: riak:8098 > Accept: */* > Content-Type:application/json > Content-Length: 13 > < HTTP/1.1 500 Internal Server Error < Vary: Accept-Encoding < Server: MochiWeb/1.1 WebMachine/1.7.1 (participate in the frantic) < Location: /riak/test/RmfYZCM8LtBPRu4gqZivu8pfVoh < Date: Tue, 05 Oct 2010 16:36:57 GMT < Content-Type: text/html < Content-Length: 713 < 500 Internal Server ErrorInternal Server ErrorThe server encountered an error while processing this request:{error,{error,{case_clause,{error,timeout}}, [{riak_kv_wm_raw,accept_doc_body,2}, {webmachine_resource,resource_call,3}, {webmachine_resource,do,3}, {webmachine_decision_core,resource_call,1}, {webmachine_decision_core,accept_helper,0}, {webmachine_decision_core,decision,1}, {webmachine_decision_core,handle_request,2}, * Connection #0 to host riak left intact * Closing connection #0 {webmachine_mochiweb,loop,1}]}}mochiweb+webmachine web server ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com