Parnell, I confirmed with the Basho team that "list_keys" is a read only process. Yes, some read operations would initiate compactions in Riak 1.1, but you have 1.2.1. I therefore suspect that there is a secondary issue.
Would you mind gathering the LOG files from one of the machines that you must mark down, and tell me the date/time of the last problem on that machine? The following command (with path changed as appropriate) should gather the LOGs just fine. Then tar/zip the output file before email. sort /var/lib/riak/leveldb/*/LOG* >LOG_all.txt I am going to guess that the 24 core machine is not having this problem. Would you send a zip file of its LOG also? I want to compare the throughput differences. Matthew On Jan 8, 2013, at 2:53 PM, Parnell Springmeyer wrote: > Matthew, > > 1. 1.2.1 > 2. > {eleveldb, [ > {data_root, "/var/riak/data/leveldb"} > ]} > 3. I'm running 5 physical servers with one Riak node per server. > 4. Unfortunately all the machines are a hodge podge of parts; we're soon > going to move to buying our own hardware and coloing it; here's the > server list (all machines are FreeBSD 9): > > Cores CPU Model CPU Speed RAM HDD Model HDD Size > 24 Xeon X5650 2.67GHz 48GiB 2X INTEL SSDSA2CW30 on LSI > MegaRaid SAS 2108 mirrored 280GB > 4 Xeon E5560 2.13GHz 12GiB Barracuda ST3500418AS 500GB > 8 Xeon E31230 3.2GHz 8GiB WDC WD1600JS 160GB > 4 Core2 Q8400 2.66GHz 8GiB Seagate ST500DM002-1BD142 500GB > 4 Xeon L5320 1.86GHz 12GiB 500GB > > 3. There were no "waiting" entries, but quite a few compaction entries, > I haven't studied leveldb enough to know if that's "normal" or if it > indicates heavy compaction. > > The compaction event seemed to be triggered by someone issuing a > list_keys operation; four servers pretty much became unresponsive while > they were doing compaction. After about an hour only two were dealing > with compaction but it was still causing the entire cluster to respond > with timeouts to index().run() queries and M/R jobs. > > I took down those two nodes and marked them as down (riak-admin down) > and the timeouts disappeared and the cluster operated as it should. So I > waited till 1AM last night to start the two machines up so they could > finish compaction. I'm somewhat surprised there isn't a method for > marking machines as "unavailable" in the event of heavy compaction - > that way they can finish compacting and the cluster can treat the node > as unavailable. I don't know how difficult that is though. >> Parnell, >> >> Would appreciate some configuration info: >> >> - what version of Riak are you running? >> >> - would you copy/paste the eleveldb section of your app.config? >> >> - how many vnodes and physical servers are you running? >> >> - what is hardware? cpu, memory, disk arrays >> >> - are you seeing the work "waiting" in your LOG files? >> >> >> Not sure that the above info will lead to a solution. But it is a start. _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com