Re: avg write io wait time regression in 1.2.1

Matthew Von-Maszewski Tue, 06 Nov 2012 17:17:29 -0800

Would you paste the data for one core from /proc/cpuinfo?  And do you know the 
brand of controller running the SSD drives?


Thank you,
Matthew


On Nov 6, 2012, at 7:04 PM, Dietrich Featherston wrote:

> Thanks for the feedback. We haven't noticed any drop in riak 
> responsiveness--quite the opposite. We were just alarmed at some of the 
> iostat information we were seeing which may very well result from, as you 
> pointed out, greater concurrency in layers above the disk subsystem. It's not 
> of concern at the moment.
> 
> As for the cluster itself--we are running 9 physical notes with a ring size 
> of 64. Each with a 2.93Ghz 8-core Xeon and ~2 TB of RAID0 SSD storage across 
> 6 physical drives.
> 
> Thanks again for looking into this.
> 
> D
> 
> 
> On Tue, Nov 6, 2012 at 3:19 PM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Dietrich,
> 
> I finally reviewed your LOG.all today.  The basic analysis is:
> 
> - you have a really fast disk subsystem, and
> - your machine is bored.
> 
> I make the first comment based upon the fact that your Level-0 file creations 
> take less than 200 ms on files of 40Mbyte with 10,000 keys (or more).  I 
> would like to know more specifics about your hardware and operating system 
> (and disk parameters).  That information is likely worthy of sharing with all.
> 
> I make the second comment after running a shallow analysis of all compaction 
> activity per minute in LOG.all.  Your current ingest rate (rate of adding 
> data to leveldb) is not very high compared to other sites I have seen.  So we 
> should continue this discussion if you still believe there is an overall slow 
> down.  
> 
> I will note that the authors of Erlang have given us some suggested changes 
> in our Erlang to leveldb interface this past weekend.  That information could 
> also give leveldb a throughput boost if proven valid.  Keep you posted.
> 
> But at this time I see nothing that yells "massive slow down".  I am of 
> course open to being wrong.
> 
> Matthew
> 
> 
> On Nov 2, 2012, at 10:32 AM, Dietrich Featherston wrote:
> 
>> Here's the level output from one of our upgraded nodes. Included our 
>> app.config as well. Will continue looking for clues. I can put together some 
>> snapshots of our sst file sizes across the cluster if you think that would 
>> help as well.
>> 
>> On Fri, Nov 2, 2012 at 6:19 AM, Matthew Von-Maszewski <matth...@basho.com> 
>> wrote:
>> Dietrich,
>> 
>> I can make two guesses into the increased disk writes.  But I am also 
>> willing to review your actual LOG files to isolate root cause.  If you could 
>> run the following and post the resulting file from one server, I will review 
>> it over the weekend or early next week:
>> 
>> sort /var/lib/riak/leveldb/*/LOG >LOG.all
>> 
>> The file will compress well.  And no need to stop the server, just gather 
>> the LOG data live.
>> 
>> Guess 1:  your data is in a transition phase.  1.1 used 2 Megabyte files 
>> exclusively.  1.2 is resizing the files to much larger sizes during a 
>> compaction.  You could be seeing a larger number of files than usual 
>> participating in each compaction as the file sizes change.  While this is 
>> possible, I have doubts … hence this is a guess.
>> 
>> Guess 2:  I increased the various leveldb file sizes to reduce the number of 
>> open and closes, both for writes and random reads.  This helped latencies in 
>> both the compactions and random reads.  Any compaction in 1.2 is likely to 
>> reread and write larger total number of bytes.  While this is possible, I 
>> again have doubts … the number of read operations should also go up if this 
>> guess is correct.  Your read operations have not increased.  This guess 
>> might still be valid if the read operations were satisfied by the Linux 
>> memory data cache.  I do not how those would be counted or not counted.
>> 
>> 
>> Matthew
>> 
>> 
>> On Nov 1, 2012, at 10:01 PM, Dietrich Featherston wrote:
>> 
>>> Will check on that.
>>> 
>>> Can you think of anything that would explain the 5x increase in disk writes 
>>> we are seeing with the same workload?
>>> 
>>> 
>>> On Thu, Nov 1, 2012 at 6:03 PM, Matthew Von-Maszewski <matth...@basho.com> 
>>> wrote:
>>> Look for any activity in the LOG.  Level-0 "creations" are fast and not 
>>> typically relevant.  You would be most interested in LOG lines containing 
>>> "Compacting" (start) and "Compacted" (end).  The time in between will 
>>> throttle.  The idea is that these compaction events can pile up, one after 
>>> another and multiple overlapping.  It is these heavy times where the 
>>> throttle saves the user experience.
>>> 
>>> Matthew
>>> 
>>> 
>>> On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote:
>>> 
>>>> Thanks. The amortized stalls may very well describe what we are seeing. If 
>>>> I combine leveldb logs from all partitions on one of the upgraded nodes 
>>>> what should I look for in terms of compaction activity to verify this?
>>>> 
>>>> 
>>>> On Thu, Nov 1, 2012 at 5:48 PM, Matthew Von-Maszewski <matth...@basho.com> 
>>>> wrote:
>>>> Dietrich,
>>>> 
>>>> I can see your concern with the write IOS statistic.  Let me comment on 
>>>> the easy question first:  block_size.
>>>> 
>>>> The block_size parameter in 1.1 was not getting passed to leveldb from the 
>>>> erlang layer.  You were using a 4096 byte block parameter no matter what 
>>>> you typed in the app.config.  The block_size is used by leveldb as a 
>>>> threshold.  Once you accumulate enough data above that threshold, the 
>>>> current block is written to disk and a new one started.  If you have 10k 
>>>> data values, your get one data item per block and its size is ~10k.  If 
>>>> you have 1k data values, you get about four per block and the block is 
>>>> about 4k.
>>>> 
>>>> We recommend 4k blocks to help read performance.  The entire block has to 
>>>> run through decompression and potentially CRC calculation when it comes 
>>>> off the disk.  That CPU time really kills any disk performance gains by 
>>>> having larger blocks.  Ok, that might change in 1.3 as we enable hardware 
>>>> CRC … but only if you have "verify_checksums, true" in app.config.
>>>> 
>>>> 
>>>> Back to performance:  I have not seen the change your graph details when 
>>>> testing with SAS drives under moderate load.  I am only today starting 
>>>> qualification tests with SSD drives.
>>>> 
>>>> But my 1.2 and 1.3 tests focus on drive / Riak saturation.  1.1 has the 
>>>> nasty tendency to stall (intentionally) when we saturate the write side of 
>>>> leveldb, .  The stall was measured in seconds or even minutes in 1.1.  
>>>> 1.2.1 has a write throttle that forecasts leveldb's stall state and 
>>>> incrementally slows the individual writes to prevent the stalls.  Maybe 
>>>> that is what is being seen in the graph.  The only way to know for sure is 
>>>> to get an dump of your leveldb LOG files, combined them, then compare 
>>>> compaction activity to your graph.
>>>> 
>>>> Write stalls are detailed here:   
>>>> http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/
>>>> 
>>>> How can I better assist you at this point?
>>>> 
>>>> Matthew
>>>> 
>>>> 
>>>> On Nov 1, 2012, at 8:13 PM, Dietrich Featherston wrote:
>>>> 
>>>>> We've just gone through the process of upgrading two riak clusters from 
>>>>> 1.1  to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. 
>>>>> The process has gone smoothly and we see that latencies as measured at 
>>>>> the gen_fsm level are largely unaffected.
>>>>> 
>>>>> However, we are seeing some troubling disk statistics and I'm looking for 
>>>>> an explanation before we upgrade the remainder of our nodes. The source 
>>>>> of the worry seems to be a huge amplification in the number of writes 
>>>>> serviced by the disk which may be the cause of rising io wait times.
>>>>> 
>>>>> My first thought was that this could be due to some leveldb tuning in 
>>>>> 1.2.1 which increases file sizes per the release notes 
>>>>> (https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes 
>>>>> that were upgraded yesterday are still showing this symptom. I would have 
>>>>> expected any block re-writing to have subsided by now.
>>>>> 
>>>>> Next hypothesis has to do with block size overriding in app.config. In 
>>>>> 1.1, we had specified custom block sizes of 256k. We removed this prior 
>>>>> to upgrading to 1.2.1 at the advice of #riak since block size 
>>>>> configuration was ignored prior to 1.2 ('"block_size" parameter within 
>>>>> app.config for leveldb was ignored.  This parameter is now properly 
>>>>> passed to leveldb.' --> 
>>>>> https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105).
>>>>>  I'm wondering if the block size parameter really was being passed to 
>>>>> leveldb, and having removed it, blocks are now being rewritten to a new 
>>>>> size, perhaps different from what they were being written as before 
>>>>> (https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155)
>>>>> 
>>>>> Here is the output of the following script showing the increased writes 
>>>>> to disk (https://gist.github.com/37319a8ed2679bb8b21d)
>>>>> 
>>>>> --an upgraded 1.2.1 node--
>>>>> read ios: 238406742
>>>>> write ios: 4814320281
>>>>> read/write ratio: .04952033
>>>>> avg wait: .10712340
>>>>> read wait: .49174364
>>>>> write wait: .42695475
>>>>> 
>>>>> 
>>>>> --a node still running 1.1--
>>>>> read ios: 267770032
>>>>> write ios: 944170656
>>>>> read/write ratio: .28360342
>>>>> avg wait: .34237204
>>>>> read wait: .47222371
>>>>> write wait: 1.83283749
>>>>> 
>>>>> And here's what munin is showing us in terms of avg io wait times.
>>>>> 
>>>>> <image.png>
>>>>> 
>>>>> 
>>>>> Any thoughts on what might explain this?
>>>>> 
>>>>> Thanks,
>>>>> D
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> riak-users@lists.basho.com
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>> 
>>>> 
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users@lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> <disk-latency-riak-1.2.1.tar.gz>
> 
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: avg write io wait time regression in 1.2.1

Reply via email to