We've just gone through the process of upgrading two riak clusters from 1.1 to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The process has gone smoothly and we see that latencies as measured at the gen_fsm level are largely unaffected.
However, we are seeing some troubling disk statistics and I'm looking for an explanation before we upgrade the remainder of our nodes. The source of the worry seems to be a huge amplification in the number of writes serviced by the disk which may be the cause of rising io wait times. My first thought was that this could be due to some leveldb tuning in 1.2.1 which increases file sizes per the release notes ( https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes that were upgraded yesterday are still showing this symptom. I would have expected any block re-writing to have subsided by now. Next hypothesis has to do with block size overriding in app.config. In 1.1, we had specified custom block sizes of 256k. We removed this prior to upgrading to 1.2.1 at the advice of #riak since block size configuration was ignored prior to 1.2 ('"block_size" parameter within app.config for leveldb was ignored. This parameter is now properly passed to leveldb.' --> https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105). I'm wondering if the block size parameter really was being passed to leveldb, and having removed it, blocks are now being rewritten to a new size, perhaps different from what they were being written as before ( https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155 ) Here is the output of the following script showing the increased writes to disk (https://gist.github.com/37319a8ed2679bb8b21d) --an upgraded 1.2.1 node-- read ios: 238406742 write ios: 4814320281 read/write ratio: .04952033 avg wait: .10712340 read wait: .49174364 write wait: .42695475 --a node still running 1.1-- read ios: 267770032 write ios: 944170656 read/write ratio: .28360342 avg wait: .34237204 read wait: .47222371 write wait: 1.83283749 And here's what munin is showing us in terms of avg io wait times. [image: Inline image 1] Any thoughts on what might explain this? Thanks, D
<<image.png>>
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com