Would you paste the data for one core from /proc/cpuinfo? And do you know the brand of controller running the SSD drives?
Thank you, Matthew On Nov 6, 2012, at 7:04 PM, Dietrich Featherston wrote: > Thanks for the feedback. We haven't noticed any drop in riak > responsiveness--quite the opposite. We were just alarmed at some of the > iostat information we were seeing which may very well result from, as you > pointed out, greater concurrency in layers above the disk subsystem. It's not > of concern at the moment. > > As for the cluster itself--we are running 9 physical notes with a ring size > of 64. Each with a 2.93Ghz 8-core Xeon and ~2 TB of RAID0 SSD storage across > 6 physical drives. > > Thanks again for looking into this. > > D > > > On Tue, Nov 6, 2012 at 3:19 PM, Matthew Von-Maszewski <matth...@basho.com> > wrote: > Dietrich, > > I finally reviewed your LOG.all today. The basic analysis is: > > - you have a really fast disk subsystem, and > - your machine is bored. > > I make the first comment based upon the fact that your Level-0 file creations > take less than 200 ms on files of 40Mbyte with 10,000 keys (or more). I > would like to know more specifics about your hardware and operating system > (and disk parameters). That information is likely worthy of sharing with all. > > I make the second comment after running a shallow analysis of all compaction > activity per minute in LOG.all. Your current ingest rate (rate of adding > data to leveldb) is not very high compared to other sites I have seen. So we > should continue this discussion if you still believe there is an overall slow > down. > > I will note that the authors of Erlang have given us some suggested changes > in our Erlang to leveldb interface this past weekend. That information could > also give leveldb a throughput boost if proven valid. Keep you posted. > > But at this time I see nothing that yells "massive slow down". I am of > course open to being wrong. > > Matthew > > > On Nov 2, 2012, at 10:32 AM, Dietrich Featherston wrote: > >> Here's the level output from one of our upgraded nodes. Included our >> app.config as well. Will continue looking for clues. I can put together some >> snapshots of our sst file sizes across the cluster if you think that would >> help as well. >> >> On Fri, Nov 2, 2012 at 6:19 AM, Matthew Von-Maszewski <matth...@basho.com> >> wrote: >> Dietrich, >> >> I can make two guesses into the increased disk writes. But I am also >> willing to review your actual LOG files to isolate root cause. If you could >> run the following and post the resulting file from one server, I will review >> it over the weekend or early next week: >> >> sort /var/lib/riak/leveldb/*/LOG >LOG.all >> >> The file will compress well. And no need to stop the server, just gather >> the LOG data live. >> >> Guess 1: your data is in a transition phase. 1.1 used 2 Megabyte files >> exclusively. 1.2 is resizing the files to much larger sizes during a >> compaction. You could be seeing a larger number of files than usual >> participating in each compaction as the file sizes change. While this is >> possible, I have doubts … hence this is a guess. >> >> Guess 2: I increased the various leveldb file sizes to reduce the number of >> open and closes, both for writes and random reads. This helped latencies in >> both the compactions and random reads. Any compaction in 1.2 is likely to >> reread and write larger total number of bytes. While this is possible, I >> again have doubts … the number of read operations should also go up if this >> guess is correct. Your read operations have not increased. This guess >> might still be valid if the read operations were satisfied by the Linux >> memory data cache. I do not how those would be counted or not counted. >> >> >> Matthew >> >> >> On Nov 1, 2012, at 10:01 PM, Dietrich Featherston wrote: >> >>> Will check on that. >>> >>> Can you think of anything that would explain the 5x increase in disk writes >>> we are seeing with the same workload? >>> >>> >>> On Thu, Nov 1, 2012 at 6:03 PM, Matthew Von-Maszewski <matth...@basho.com> >>> wrote: >>> Look for any activity in the LOG. Level-0 "creations" are fast and not >>> typically relevant. You would be most interested in LOG lines containing >>> "Compacting" (start) and "Compacted" (end). The time in between will >>> throttle. The idea is that these compaction events can pile up, one after >>> another and multiple overlapping. It is these heavy times where the >>> throttle saves the user experience. >>> >>> Matthew >>> >>> >>> On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote: >>> >>>> Thanks. The amortized stalls may very well describe what we are seeing. If >>>> I combine leveldb logs from all partitions on one of the upgraded nodes >>>> what should I look for in terms of compaction activity to verify this? >>>> >>>> >>>> On Thu, Nov 1, 2012 at 5:48 PM, Matthew Von-Maszewski <matth...@basho.com> >>>> wrote: >>>> Dietrich, >>>> >>>> I can see your concern with the write IOS statistic. Let me comment on >>>> the easy question first: block_size. >>>> >>>> The block_size parameter in 1.1 was not getting passed to leveldb from the >>>> erlang layer. You were using a 4096 byte block parameter no matter what >>>> you typed in the app.config. The block_size is used by leveldb as a >>>> threshold. Once you accumulate enough data above that threshold, the >>>> current block is written to disk and a new one started. If you have 10k >>>> data values, your get one data item per block and its size is ~10k. If >>>> you have 1k data values, you get about four per block and the block is >>>> about 4k. >>>> >>>> We recommend 4k blocks to help read performance. The entire block has to >>>> run through decompression and potentially CRC calculation when it comes >>>> off the disk. That CPU time really kills any disk performance gains by >>>> having larger blocks. Ok, that might change in 1.3 as we enable hardware >>>> CRC … but only if you have "verify_checksums, true" in app.config. >>>> >>>> >>>> Back to performance: I have not seen the change your graph details when >>>> testing with SAS drives under moderate load. I am only today starting >>>> qualification tests with SSD drives. >>>> >>>> But my 1.2 and 1.3 tests focus on drive / Riak saturation. 1.1 has the >>>> nasty tendency to stall (intentionally) when we saturate the write side of >>>> leveldb, . The stall was measured in seconds or even minutes in 1.1. >>>> 1.2.1 has a write throttle that forecasts leveldb's stall state and >>>> incrementally slows the individual writes to prevent the stalls. Maybe >>>> that is what is being seen in the graph. The only way to know for sure is >>>> to get an dump of your leveldb LOG files, combined them, then compare >>>> compaction activity to your graph. >>>> >>>> Write stalls are detailed here: >>>> http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/ >>>> >>>> How can I better assist you at this point? >>>> >>>> Matthew >>>> >>>> >>>> On Nov 1, 2012, at 8:13 PM, Dietrich Featherston wrote: >>>> >>>>> We've just gone through the process of upgrading two riak clusters from >>>>> 1.1 to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. >>>>> The process has gone smoothly and we see that latencies as measured at >>>>> the gen_fsm level are largely unaffected. >>>>> >>>>> However, we are seeing some troubling disk statistics and I'm looking for >>>>> an explanation before we upgrade the remainder of our nodes. The source >>>>> of the worry seems to be a huge amplification in the number of writes >>>>> serviced by the disk which may be the cause of rising io wait times. >>>>> >>>>> My first thought was that this could be due to some leveldb tuning in >>>>> 1.2.1 which increases file sizes per the release notes >>>>> (https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes >>>>> that were upgraded yesterday are still showing this symptom. I would have >>>>> expected any block re-writing to have subsided by now. >>>>> >>>>> Next hypothesis has to do with block size overriding in app.config. In >>>>> 1.1, we had specified custom block sizes of 256k. We removed this prior >>>>> to upgrading to 1.2.1 at the advice of #riak since block size >>>>> configuration was ignored prior to 1.2 ('"block_size" parameter within >>>>> app.config for leveldb was ignored. This parameter is now properly >>>>> passed to leveldb.' --> >>>>> https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105). >>>>> I'm wondering if the block size parameter really was being passed to >>>>> leveldb, and having removed it, blocks are now being rewritten to a new >>>>> size, perhaps different from what they were being written as before >>>>> (https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155) >>>>> >>>>> Here is the output of the following script showing the increased writes >>>>> to disk (https://gist.github.com/37319a8ed2679bb8b21d) >>>>> >>>>> --an upgraded 1.2.1 node-- >>>>> read ios: 238406742 >>>>> write ios: 4814320281 >>>>> read/write ratio: .04952033 >>>>> avg wait: .10712340 >>>>> read wait: .49174364 >>>>> write wait: .42695475 >>>>> >>>>> >>>>> --a node still running 1.1-- >>>>> read ios: 267770032 >>>>> write ios: 944170656 >>>>> read/write ratio: .28360342 >>>>> avg wait: .34237204 >>>>> read wait: .47222371 >>>>> write wait: 1.83283749 >>>>> >>>>> And here's what munin is showing us in terms of avg io wait times. >>>>> >>>>> <image.png> >>>>> >>>>> >>>>> Any thoughts on what might explain this? >>>>> >>>>> Thanks, >>>>> D >>>>> >>>>> >>>>> _______________________________________________ >>>>> riak-users mailing list >>>>> riak-users@lists.basho.com >>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> riak-users@lists.basho.com >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>>> >>> >>> >> >> >> <disk-latency-riak-1.2.1.tar.gz> > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com