We've just gone through the process of upgrading two riak clusters from 1.1
 to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The
process has gone smoothly and we see that latencies as measured at the
gen_fsm level are largely unaffected.

However, we are seeing some troubling disk statistics and I'm looking for
an explanation before we upgrade the remainder of our nodes. The source of
the worry seems to be a huge amplification in the number of writes serviced
by the disk which may be the cause of rising io wait times.

My first thought was that this could be due to some leveldb tuning in 1.2.1
which increases file sizes per the release notes (
https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes that
were upgraded yesterday are still showing this symptom. I would have
expected any block re-writing to have subsided by now.

Next hypothesis has to do with block size overriding in app.config. In 1.1,
we had specified custom block sizes of 256k. We removed this prior to
upgrading to 1.2.1 at the advice of #riak since block size configuration
was ignored prior to 1.2 ('"block_size" parameter within app.config for
leveldb was ignored.  This parameter is now properly passed to leveldb.'
-->
https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105).
I'm wondering if the block size parameter really was being passed to
leveldb, and having removed it, blocks are now being rewritten to a new
size, perhaps different from what they were being written as before (
https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155
)

Here is the output of the following script showing the increased writes to
disk (https://gist.github.com/37319a8ed2679bb8b21d)

--an upgraded 1.2.1 node--
read ios: 238406742
write ios: 4814320281
read/write ratio: .04952033
avg wait: .10712340
read wait: .49174364
write wait: .42695475


--a node still running 1.1--
read ios: 267770032
write ios: 944170656
read/write ratio: .28360342
avg wait: .34237204
read wait: .47222371
write wait: 1.83283749

And here's what munin is showing us in terms of avg io wait times.

[image: Inline image 1]


Any thoughts on what might explain this?

Thanks,
D

<<image.png>>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to