Re: avg write io wait time regression in 1.2.1

Dave Brady Fri, 02 Nov 2012 08:18:20 -0700

I would like to cast my vote for having compactions exposed by "riak-admin".

We can already observe hinted handoff transfers. It would be very beneficial to 
watch compactions in real-time, too. 

Log scrapping carries too many potential headaches. 

----- Original Message -----

From: "Matthew Von-Maszewski" <matth...@basho.com> 
To: "Dietrich Featherston" <d...@d2fn.com> 
Cc: riak-users@lists.basho.com 
Sent: Friday, November 2, 2012 2:19:40 PM 
Subject: Re: avg write io wait time regression in 1.2.1 

Dietrich, 

I can make two guesses into the increased disk writes. But I am also willing to 
review your actual LOG files to isolate root cause. If you could run the 
following and post the resulting file from one server, I will review it over 
the weekend or early next week: 

sort /var/lib/riak/leveldb/*/LOG >LOG.all 

The file will compress well. And no need to stop the server, just gather the 
LOG data live. 

Guess 1: your data is in a transition phase. 1.1 used 2 Megabyte files 
exclusively. 1.2 is resizing the files to much larger sizes during a 
compaction. You could be seeing a larger number of files than usual 
participating in each compaction as the file sizes change. While this is 
possible, I have doubts … hence this is a guess. 

Guess 2: I increased the various leveldb file sizes to reduce the number of 
open and closes, both for writes and random reads. This helped latencies in 
both the compactions and random reads. Any compaction in 1.2 is likely to 
reread and write larger total number of bytes. While this is possible, I again 
have doubts … the number of read operations should also go up if this guess is 
correct. Your read operations have not increased. This guess might still be 
valid if the read operations were satisfied by the Linux memory data cache. I 
do not how those would be counted or not counted. 

Matthew 

On Nov 1, 2012, at 10:01 PM, Dietrich Featherston wrote: 

Will check on that. 

Can you think of anything that would explain the 5x increase in disk writes we 
are seeing with the same workload? 

On Thu, Nov 1, 2012 at 6:03 PM, Matthew Von-Maszewski < matth...@basho.com > 
wrote: 

<blockquote>

Look for any activity in the LOG. Level-0 "creations" are fast and not 
typically relevant. You would be most interested in LOG lines containing 
"Compacting" (start) and "Compacted" (end). The time in between will throttle. 
The idea is that these compaction events can pile up, one after another and 
multiple overlapping. It is these heavy times where the throttle saves the user 
experience. 

Matthew 

On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote: 

<blockquote>
Thanks. The amortized stalls may very well describe what we are seeing. If I 
combine leveldb logs from all partitions on one of the upgraded nodes what 
should I look for in terms of compaction activity to verify this? 

On Thu, Nov 1, 2012 at 5:48 PM, Matthew Von-Maszewski < matth...@basho.com > 
wrote: 

<blockquote>

Dietrich, 

I can see your concern with the write IOS statistic. Let me comment on the easy 
question first: block_size. 

The block_size parameter in 1.1 was not getting passed to leveldb from the 
erlang layer. You were using a 4096 byte block parameter no matter what you 
typed in the app.config. The block_size is used by leveldb as a threshold. Once 
you accumulate enough data above that threshold, the current block is written 
to disk and a new one started. If you have 10k data values, your get one data 
item per block and its size is ~10k. If you have 1k data values, you get about 
four per block and the block is about 4k. 

We recommend 4k blocks to help read performance. The entire block has to run 
through decompression and potentially CRC calculation when it comes off the 
disk. That CPU time really kills any disk performance gains by having larger 
blocks. Ok, that might change in 1.3 as we enable hardware CRC … but only if 
you have "verify_checksums, true" in app.config. 

Back to performance: I have not seen the change your graph details when testing 
with SAS drives under moderate load. I am only today starting qualification 
tests with SSD drives. 

But my 1.2 and 1.3 tests focus on drive / Riak saturation. 1.1 has the nasty 
tendency to stall (intentionally) when we saturate the write side of leveldb, . 
The stall was measured in seconds or even minutes in 1.1. 1.2.1 has a write 
throttle that forecasts leveldb's stall state and incrementally slows the 
individual writes to prevent the stalls. Maybe that is what is being seen in 
the graph. The only way to know for sure is to get an dump of your leveldb LOG 
files, combined them, then compare compaction activity to your graph. 

Write stalls are detailed here: 
http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/ 

How can I better assist you at this point? 

Matthew 

On Nov 1, 2012, at 8:13 PM, Dietrich Featherston wrote: 

<blockquote>

We've just gone through the process of upgrading two riak clusters from 1.1 to 
1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The process has 
gone smoothly and we see that latencies as measured at the gen_fsm level are 
largely unaffected. 

However, we are seeing some troubling disk statistics and I'm looking for an 
explanation before we upgrade the remainder of our nodes. The source of the 
worry seems to be a huge amplification in the number of writes serviced by the 
disk which may be the cause of rising io wait times. 

My first thought was that this could be due to some leveldb tuning in 1.2.1 
which increases file sizes per the release notes ( 
https://github.com/basho/riak/blob/master/RELEASE-NOTES.md ). But nodes that 
were upgraded yesterday are still showing this symptom. I would have expected 
any block re-writing to have subsided by now. 

Next hypothesis has to do with block size overriding in app.config. In 1.1, we 
had specified custom block sizes of 256k. We removed this prior to upgrading to 
1.2.1 at the advice of #riak since block size configuration was ignored prior 
to 1.2 ('"block_size" parameter within app.config for leveldb was ignored. This 
parameter is now properly passed to leveldb.' --> 
https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105 
). I'm wondering if the block size parameter really was being passed to 
leveldb, and having removed it, blocks are now being rewritten to a new size, 
perhaps different from what they were being written as before ( 
https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155
 ) 

Here is the output of the following script showing the increased writes to disk 
( https://gist.github.com/37319a8ed2679bb8b21d ) 

--an upgraded 1.2.1 node-- 
read ios: 238406742 
write ios: 4814320281 
read/write ratio: .04952033 
avg wait: .10712340 
read wait: .49174364 
write wait: .42695475 

--a node still running 1.1-- 
read ios: 267770032 
write ios: 944170656 
read/write ratio: .28360342 
avg wait: .34237204 
read wait: .47222371 
write wait: 1.83283749 

And here's what munin is showing us in terms of avg io wait times. 

<image.png> 

Any thoughts on what might explain this? 

Thanks, 
D 

_______________________________________________ 
riak-users mailing list 
riak-users@lists.basho.com 
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com 

_______________________________________________ 
riak-users mailing list 
riak-users@lists.basho.com 
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com 

</blockquote>

</blockquote>

</blockquote>

</blockquote>

_______________________________________________ 
riak-users mailing list 
riak-users@lists.basho.com 
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: avg write io wait time regression in 1.2.1

Reply via email to