Re: avg write io wait time regression in 1.2.1

Matthew Von-Maszewski Fri, 02 Nov 2012 06:19:51 -0700

Dietrich,

I can make two guesses into the increased disk writes.  But I am also willing 
to review your actual LOG files to isolate root cause.  If you could run the 
following and post the resulting file from one server, I will review it over 
the weekend or early next week:


sort /var/lib/riak/leveldb/*/LOG >LOG.all

The file will compress well.  And no need to stop the server, just gather the 
LOG data live.

Guess 1:  your data is in a transition phase.  1.1 used 2 Megabyte files 
exclusively.  1.2 is resizing the files to much larger sizes during a 
compaction.  You could be seeing a larger number of files than usual 
participating in each compaction as the file sizes change.  While this is 
possible, I have doubts … hence this is a guess.

Guess 2:  I increased the various leveldb file sizes to reduce the number of 
open and closes, both for writes and random reads.  This helped latencies in 
both the compactions and random reads.  Any compaction in 1.2 is likely to 
reread and write larger total number of bytes.  While this is possible, I again 
have doubts … the number of read operations should also go up if this guess is 
correct.  Your read operations have not increased.  This guess might still be 
valid if the read operations were satisfied by the Linux memory data cache.  I 
do not how those would be counted or not counted.


Matthew


On Nov 1, 2012, at 10:01 PM, Dietrich Featherston wrote:

> Will check on that.
> 
> Can you think of anything that would explain the 5x increase in disk writes 
> we are seeing with the same workload?
> 
> 
> On Thu, Nov 1, 2012 at 6:03 PM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Look for any activity in the LOG.  Level-0 "creations" are fast and not 
> typically relevant.  You would be most interested in LOG lines containing 
> "Compacting" (start) and "Compacted" (end).  The time in between will 
> throttle.  The idea is that these compaction events can pile up, one after 
> another and multiple overlapping.  It is these heavy times where the throttle 
> saves the user experience.
> 
> Matthew
> 
> 
> On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote:
> 
>> Thanks. The amortized stalls may very well describe what we are seeing. If I 
>> combine leveldb logs from all partitions on one of the upgraded nodes what 
>> should I look for in terms of compaction activity to verify this?
>> 
>> 
>> On Thu, Nov 1, 2012 at 5:48 PM, Matthew Von-Maszewski <matth...@basho.com> 
>> wrote:
>> Dietrich,
>> 
>> I can see your concern with the write IOS statistic.  Let me comment on the 
>> easy question first:  block_size.
>> 
>> The block_size parameter in 1.1 was not getting passed to leveldb from the 
>> erlang layer.  You were using a 4096 byte block parameter no matter what you 
>> typed in the app.config.  The block_size is used by leveldb as a threshold.  
>> Once you accumulate enough data above that threshold, the current block is 
>> written to disk and a new one started.  If you have 10k data values, your 
>> get one data item per block and its size is ~10k.  If you have 1k data 
>> values, you get about four per block and the block is about 4k.
>> 
>> We recommend 4k blocks to help read performance.  The entire block has to 
>> run through decompression and potentially CRC calculation when it comes off 
>> the disk.  That CPU time really kills any disk performance gains by having 
>> larger blocks.  Ok, that might change in 1.3 as we enable hardware CRC … but 
>> only if you have "verify_checksums, true" in app.config.
>> 
>> 
>> Back to performance:  I have not seen the change your graph details when 
>> testing with SAS drives under moderate load.  I am only today starting 
>> qualification tests with SSD drives.
>> 
>> But my 1.2 and 1.3 tests focus on drive / Riak saturation.  1.1 has the 
>> nasty tendency to stall (intentionally) when we saturate the write side of 
>> leveldb, .  The stall was measured in seconds or even minutes in 1.1.  1.2.1 
>> has a write throttle that forecasts leveldb's stall state and incrementally 
>> slows the individual writes to prevent the stalls.  Maybe that is what is 
>> being seen in the graph.  The only way to know for sure is to get an dump of 
>> your leveldb LOG files, combined them, then compare compaction activity to 
>> your graph.
>> 
>> Write stalls are detailed here:   
>> http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/
>> 
>> How can I better assist you at this point?
>> 
>> Matthew
>> 
>> 
>> On Nov 1, 2012, at 8:13 PM, Dietrich Featherston wrote:
>> 
>>> We've just gone through the process of upgrading two riak clusters from 1.1 
>>>  to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The 
>>> process has gone smoothly and we see that latencies as measured at the 
>>> gen_fsm level are largely unaffected.
>>> 
>>> However, we are seeing some troubling disk statistics and I'm looking for 
>>> an explanation before we upgrade the remainder of our nodes. The source of 
>>> the worry seems to be a huge amplification in the number of writes serviced 
>>> by the disk which may be the cause of rising io wait times.
>>> 
>>> My first thought was that this could be due to some leveldb tuning in 1.2.1 
>>> which increases file sizes per the release notes 
>>> (https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes 
>>> that were upgraded yesterday are still showing this symptom. I would have 
>>> expected any block re-writing to have subsided by now.
>>> 
>>> Next hypothesis has to do with block size overriding in app.config. In 1.1, 
>>> we had specified custom block sizes of 256k. We removed this prior to 
>>> upgrading to 1.2.1 at the advice of #riak since block size configuration 
>>> was ignored prior to 1.2 ('"block_size" parameter within app.config for 
>>> leveldb was ignored.  This parameter is now properly passed to leveldb.' 
>>> --> 
>>> https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105).
>>>  I'm wondering if the block size parameter really was being passed to 
>>> leveldb, and having removed it, blocks are now being rewritten to a new 
>>> size, perhaps different from what they were being written as before 
>>> (https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155)
>>> 
>>> Here is the output of the following script showing the increased writes to 
>>> disk (https://gist.github.com/37319a8ed2679bb8b21d)
>>> 
>>> --an upgraded 1.2.1 node--
>>> read ios: 238406742
>>> write ios: 4814320281
>>> read/write ratio: .04952033
>>> avg wait: .10712340
>>> read wait: .49174364
>>> write wait: .42695475
>>> 
>>> 
>>> --a node still running 1.1--
>>> read ios: 267770032
>>> write ios: 944170656
>>> read/write ratio: .28360342
>>> avg wait: .34237204
>>> read wait: .47222371
>>> write wait: 1.83283749
>>> 
>>> And here's what munin is showing us in terms of avg io wait times.
>>> 
>>> <image.png>
>>> 
>>> 
>>> Any thoughts on what might explain this?
>>> 
>>> Thanks,
>>> D
>>> 
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
> 
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: avg write io wait time regression in 1.2.1

Reply via email to