Re: avg write io wait time regression in 1.2.1

Matthew Von-Maszewski Fri, 02 Nov 2012 13:24:44 -0700

(pasting offline email content for mail list's benefit)

On Nov 2, 2012, at 4:20 PM, Dietrich Featherston wrote:


> Thaaaaaaaaaaat's really interesting. I've also noticed that the sharp 
> increase in disk use is either sub-linear or is tapering off. Good stuff.
> 
> On Fri, Nov 2, 2012 at 1:00 PM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Is there any chance that your input rate to leveldb / riak increased after 
> upgrading to 1.2.1?  I am not asking if you intentionally increased the input 
> (write) rate, but if it happen to go up due to the changes in Riak and 
> leveldb.  
> 
> I ask because 5x increase just happens to match the fact that 1.1 had ONE 
> compaction thread and 1.2.1 has FIVE compaction threads (ok, more than five 
> actually … but only five active at any one time).  But this increased number 
> of threads would only have a 5x increase for a short transition period unless 
> 1.2.1 also happened to allow more data to flow into the system from your 
> applications.
> 
> Matthew
> 
On Nov 2, 2012, at 10:32 AM, Dietrich Featherston wrote:

> Here's the level output from one of our upgraded nodes. Included our 
> app.config as well. Will continue looking for clues. I can put together some 
> snapshots of our sst file sizes across the cluster if you think that would 
> help as well.
> 
> On Fri, Nov 2, 2012 at 6:19 AM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Dietrich,
> 
> I can make two guesses into the increased disk writes.  But I am also willing 
> to review your actual LOG files to isolate root cause.  If you could run the 
> following and post the resulting file from one server, I will review it over 
> the weekend or early next week:
> 
> sort /var/lib/riak/leveldb/*/LOG >LOG.all
> 
> The file will compress well.  And no need to stop the server, just gather the 
> LOG data live.
> 
> Guess 1:  your data is in a transition phase.  1.1 used 2 Megabyte files 
> exclusively.  1.2 is resizing the files to much larger sizes during a 
> compaction.  You could be seeing a larger number of files than usual 
> participating in each compaction as the file sizes change.  While this is 
> possible, I have doubts … hence this is a guess.
> 
> Guess 2:  I increased the various leveldb file sizes to reduce the number of 
> open and closes, both for writes and random reads.  This helped latencies in 
> both the compactions and random reads.  Any compaction in 1.2 is likely to 
> reread and write larger total number of bytes.  While this is possible, I 
> again have doubts … the number of read operations should also go up if this 
> guess is correct.  Your read operations have not increased.  This guess might 
> still be valid if the read operations were satisfied by the Linux memory data 
> cache.  I do not how those would be counted or not counted.
> 
> 
> Matthew
> 
> 
> On Nov 1, 2012, at 10:01 PM, Dietrich Featherston wrote:
> 
>> Will check on that.
>> 
>> Can you think of anything that would explain the 5x increase in disk writes 
>> we are seeing with the same workload?
>> 
>> 
>> On Thu, Nov 1, 2012 at 6:03 PM, Matthew Von-Maszewski <matth...@basho.com> 
>> wrote:
>> Look for any activity in the LOG.  Level-0 "creations" are fast and not 
>> typically relevant.  You would be most interested in LOG lines containing 
>> "Compacting" (start) and "Compacted" (end).  The time in between will 
>> throttle.  The idea is that these compaction events can pile up, one after 
>> another and multiple overlapping.  It is these heavy times where the 
>> throttle saves the user experience.
>> 
>> Matthew
>> 
>> 
>> On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote:
>> 
>>> Thanks. The amortized stalls may very well describe what we are seeing. If 
>>> I combine leveldb logs from all partitions on one of the upgraded nodes 
>>> what should I look for in terms of compaction activity to verify this?
>>> 
>>> 
>>> On Thu, Nov 1, 2012 at 5:48 PM, Matthew Von-Maszewski <matth...@basho.com> 
>>> wrote:
>>> Dietrich,
>>> 
>>> I can see your concern with the write IOS statistic.  Let me comment on the 
>>> easy question first:  block_size.
>>> 
>>> The block_size parameter in 1.1 was not getting passed to leveldb from the 
>>> erlang layer.  You were using a 4096 byte block parameter no matter what 
>>> you typed in the app.config.  The block_size is used by leveldb as a 
>>> threshold.  Once you accumulate enough data above that threshold, the 
>>> current block is written to disk and a new one started.  If you have 10k 
>>> data values, your get one data item per block and its size is ~10k.  If you 
>>> have 1k data values, you get about four per block and the block is about 4k.
>>> 
>>> We recommend 4k blocks to help read performance.  The entire block has to 
>>> run through decompression and potentially CRC calculation when it comes off 
>>> the disk.  That CPU time really kills any disk performance gains by having 
>>> larger blocks.  Ok, that might change in 1.3 as we enable hardware CRC … 
>>> but only if you have "verify_checksums, true" in app.config.
>>> 
>>> 
>>> Back to performance:  I have not seen the change your graph details when 
>>> testing with SAS drives under moderate load.  I am only today starting 
>>> qualification tests with SSD drives.
>>> 
>>> But my 1.2 and 1.3 tests focus on drive / Riak saturation.  1.1 has the 
>>> nasty tendency to stall (intentionally) when we saturate the write side of 
>>> leveldb, .  The stall was measured in seconds or even minutes in 1.1.  
>>> 1.2.1 has a write throttle that forecasts leveldb's stall state and 
>>> incrementally slows the individual writes to prevent the stalls.  Maybe 
>>> that is what is being seen in the graph.  The only way to know for sure is 
>>> to get an dump of your leveldb LOG files, combined them, then compare 
>>> compaction activity to your graph.
>>> 
>>> Write stalls are detailed here:   
>>> http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/
>>> 
>>> How can I better assist you at this point?
>>> 
>>> Matthew
>>> 
>>> 
>>> On Nov 1, 2012, at 8:13 PM, Dietrich Featherston wrote:
>>> 
>>>> We've just gone through the process of upgrading two riak clusters from 
>>>> 1.1  to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The 
>>>> process has gone smoothly and we see that latencies as measured at the 
>>>> gen_fsm level are largely unaffected.
>>>> 
>>>> However, we are seeing some troubling disk statistics and I'm looking for 
>>>> an explanation before we upgrade the remainder of our nodes. The source of 
>>>> the worry seems to be a huge amplification in the number of writes 
>>>> serviced by the disk which may be the cause of rising io wait times.
>>>> 
>>>> My first thought was that this could be due to some leveldb tuning in 
>>>> 1.2.1 which increases file sizes per the release notes 
>>>> (https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes 
>>>> that were upgraded yesterday are still showing this symptom. I would have 
>>>> expected any block re-writing to have subsided by now.
>>>> 
>>>> Next hypothesis has to do with block size overriding in app.config. In 
>>>> 1.1, we had specified custom block sizes of 256k. We removed this prior to 
>>>> upgrading to 1.2.1 at the advice of #riak since block size configuration 
>>>> was ignored prior to 1.2 ('"block_size" parameter within app.config for 
>>>> leveldb was ignored.  This parameter is now properly passed to leveldb.' 
>>>> --> 
>>>> https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105).
>>>>  I'm wondering if the block size parameter really was being passed to 
>>>> leveldb, and having removed it, blocks are now being rewritten to a new 
>>>> size, perhaps different from what they were being written as before 
>>>> (https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155)
>>>> 
>>>> Here is the output of the following script showing the increased writes to 
>>>> disk (https://gist.github.com/37319a8ed2679bb8b21d)
>>>> 
>>>> --an upgraded 1.2.1 node--
>>>> read ios: 238406742
>>>> write ios: 4814320281
>>>> read/write ratio: .04952033
>>>> avg wait: .10712340
>>>> read wait: .49174364
>>>> write wait: .42695475
>>>> 
>>>> 
>>>> --a node still running 1.1--
>>>> read ios: 267770032
>>>> write ios: 944170656
>>>> read/write ratio: .28360342
>>>> avg wait: .34237204
>>>> read wait: .47222371
>>>> write wait: 1.83283749
>>>> 
>>>> And here's what munin is showing us in terms of avg io wait times.
>>>> 
>>>> <image.png>
>>>> 
>>>> 
>>>> Any thoughts on what might explain this?
>>>> 
>>>> Thanks,
>>>> D
>>>> 
>>>> 
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users@lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>>> 
>> 
>> 
> 
> 
> <disk-latency-riak-1.2.1.tar.gz>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: avg write io wait time regression in 1.2.1

Reply via email to