Re: avg write io wait time regression in 1.2.1

Matthew Von-Maszewski Fri, 02 Nov 2012 13:20:21 -0700

Dave,

One of the other developer is looking into whether the compaction counters can 
appear in riak-admin.  I should know by Tuesday if this is possible for the 1.3 
release.  1.3's code freeze is Friday 11/9.  A leveldb counters interface 
should be in 1.3.  It allows external programs to monitor (if the riak-admin 
does not make the cut).


Matthew

On Nov 2, 2012, at 11:18 AM, Dave Brady wrote:

> I would like to cast my vote for having compactions exposed by "riak-admin".
> 
> We can already observe hinted handoff transfers.  It would be very beneficial 
> to watch compactions in real-time, too.
> 
> Log scrapping carries too many potential headaches.
> 
> From: "Matthew Von-Maszewski" <matth...@basho.com>
> To: "Dietrich Featherston" <d...@d2fn.com>
> Cc: riak-users@lists.basho.com
> Sent: Friday, November 2, 2012 2:19:40 PM
> Subject: Re: avg write io wait time regression in 1.2.1
> 
> Dietrich,
> 
> I can make two guesses into the increased disk writes.  But I am also willing 
> to review your actual LOG files to isolate root cause.  If you could run the 
> following and post the resulting file from one server, I will review it over 
> the weekend or early next week:
> 
> sort /var/lib/riak/leveldb/*/LOG >LOG.all
> 
> The file will compress well.  And no need to stop the server, just gather the 
> LOG data live.
> 
> Guess 1:  your data is in a transition phase.  1.1 used 2 Megabyte files 
> exclusively.  1.2 is resizing the files to much larger sizes during a 
> compaction.  You could be seeing a larger number of files than usual 
> participating in each compaction as the file sizes change.  While this is 
> possible, I have doubts … hence this is a guess.
> 
> Guess 2:  I increased the various leveldb file sizes to reduce the number of 
> open and closes, both for writes and random reads.  This helped latencies in 
> both the compactions and random reads.  Any compaction in 1.2 is likely to 
> reread and write larger total number of bytes.  While this is possible, I 
> again have doubts … the number of read operations should also go up if this 
> guess is correct.  Your read operations have not increased.  This guess might 
> still be valid if the read operations were satisfied by the Linux memory data 
> cache.  I do not how those would be counted or not counted.
> 
> 
> Matthew
> 
> 
> On Nov 1, 2012, at 10:01 PM, Dietrich Featherston wrote:
> 
> Will check on that.
> 
> Can you think of anything that would explain the 5x increase in disk writes 
> we are seeing with the same workload?
> 
> 
> On Thu, Nov 1, 2012 at 6:03 PM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Look for any activity in the LOG.  Level-0 "creations" are fast and not 
> typically relevant.  You would be most interested in LOG lines containing 
> "Compacting" (start) and "Compacted" (end).  The time in between will 
> throttle.  The idea is that these compaction events can pile up, one after 
> another and multiple overlapping.  It is these heavy times where the throttle 
> saves the user experience.
> 
> Matthew
> 
> 
> On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote:
> 
> Thanks. The amortized stalls may very well describe what we are seeing. If I 
> combine leveldb logs from all partitions on one of the upgraded nodes what 
> should I look for in terms of compaction activity to verify this?
> 
> 
> On Thu, Nov 1, 2012 at 5:48 PM, Matthew Von-Maszewski <matth...@basho.com> 
> wrote:
> Dietrich,
> 
> I can see your concern with the write IOS statistic.  Let me comment on the 
> easy question first:  block_size.
> 
> The block_size parameter in 1.1 was not getting passed to leveldb from the 
> erlang layer.  You were using a 4096 byte block parameter no matter what you 
> typed in the app.config.  The block_size is used by leveldb as a threshold.  
> Once you accumulate enough data above that threshold, the current block is 
> written to disk and a new one started.  If you have 10k data values, your get 
> one data item per block and its size is ~10k.  If you have 1k data values, 
> you get about four per block and the block is about 4k.
> 
> We recommend 4k blocks to help read performance.  The entire block has to run 
> through decompression and potentially CRC calculation when it comes off the 
> disk.  That CPU time really kills any disk performance gains by having larger 
> blocks.  Ok, that might change in 1.3 as we enable hardware CRC … but only if 
> you have "verify_checksums, true" in app.config.
> 
> 
> Back to performance:  I have not seen the change your graph details when 
> testing with SAS drives under moderate load.  I am only today starting 
> qualification tests with SSD drives.
> 
> But my 1.2 and 1.3 tests focus on drive / Riak saturation.  1.1 has the nasty 
> tendency to stall (intentionally) when we saturate the write side of leveldb, 
> .  The stall was measured in seconds or even minutes in 1.1.  1.2.1 has a 
> write throttle that forecasts leveldb's stall state and incrementally slows 
> the individual writes to prevent the stalls.  Maybe that is what is being 
> seen in the graph.  The only way to know for sure is to get an dump of your 
> leveldb LOG files, combined them, then compare compaction activity to your 
> graph.
> 
> Write stalls are detailed here:   
> http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/
> 
> How can I better assist you at this point?
> 
> Matthew
> 
> 
> On Nov 1, 2012, at 8:13 PM, Dietrich Featherston wrote:
> 
> We've just gone through the process of upgrading two riak clusters from 1.1  
> to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The process 
> has gone smoothly and we see that latencies as measured at the gen_fsm level 
> are largely unaffected.
> 
> However, we are seeing some troubling disk statistics and I'm looking for an 
> explanation before we upgrade the remainder of our nodes. The source of the 
> worry seems to be a huge amplification in the number of writes serviced by 
> the disk which may be the cause of rising io wait times.
> 
> My first thought was that this could be due to some leveldb tuning in 1.2.1 
> which increases file sizes per the release notes 
> (https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes that 
> were upgraded yesterday are still showing this symptom. I would have expected 
> any block re-writing to have subsided by now.
> 
> Next hypothesis has to do with block size overriding in app.config. In 1.1, 
> we had specified custom block sizes of 256k. We removed this prior to 
> upgrading to 1.2.1 at the advice of #riak since block size configuration was 
> ignored prior to 1.2 ('"block_size" parameter within app.config for leveldb 
> was ignored.  This parameter is now properly passed to leveldb.' --> 
> https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105).
>  I'm wondering if the block size parameter really was being passed to 
> leveldb, and having removed it, blocks are now being rewritten to a new size, 
> perhaps different from what they were being written as before 
> (https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155)
> 
> Here is the output of the following script showing the increased writes to 
> disk (https://gist.github.com/37319a8ed2679bb8b21d)
> 
> --an upgraded 1.2.1 node--
> read ios: 238406742
> write ios: 4814320281
> read/write ratio: .04952033
> avg wait: .10712340
> read wait: .49174364
> write wait: .42695475
> 
> 
> --a node still running 1.1--
> read ios: 267770032
> write ios: 944170656
> read/write ratio: .28360342
> avg wait: .34237204
> read wait: .47222371
> write wait: 1.83283749
> 
> And here's what munin is showing us in terms of avg io wait times.
> 
> <image.png>
> 
> 
> Any thoughts on what might explain this?
> 
> Thanks,
> D
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: avg write io wait time regression in 1.2.1

Reply via email to