(pasting offline email content for mail list's benefit) On Nov 2, 2012, at 4:20 PM, Dietrich Featherston wrote:
> Thaaaaaaaaaaat's really interesting. I've also noticed that the sharp > increase in disk use is either sub-linear or is tapering off. Good stuff. > > On Fri, Nov 2, 2012 at 1:00 PM, Matthew Von-Maszewski <matth...@basho.com> > wrote: > Is there any chance that your input rate to leveldb / riak increased after > upgrading to 1.2.1? I am not asking if you intentionally increased the input > (write) rate, but if it happen to go up due to the changes in Riak and > leveldb. > > I ask because 5x increase just happens to match the fact that 1.1 had ONE > compaction thread and 1.2.1 has FIVE compaction threads (ok, more than five > actually … but only five active at any one time). But this increased number > of threads would only have a 5x increase for a short transition period unless > 1.2.1 also happened to allow more data to flow into the system from your > applications. > > Matthew > On Nov 2, 2012, at 10:32 AM, Dietrich Featherston wrote: > Here's the level output from one of our upgraded nodes. Included our > app.config as well. Will continue looking for clues. I can put together some > snapshots of our sst file sizes across the cluster if you think that would > help as well. > > On Fri, Nov 2, 2012 at 6:19 AM, Matthew Von-Maszewski <matth...@basho.com> > wrote: > Dietrich, > > I can make two guesses into the increased disk writes. But I am also willing > to review your actual LOG files to isolate root cause. If you could run the > following and post the resulting file from one server, I will review it over > the weekend or early next week: > > sort /var/lib/riak/leveldb/*/LOG >LOG.all > > The file will compress well. And no need to stop the server, just gather the > LOG data live. > > Guess 1: your data is in a transition phase. 1.1 used 2 Megabyte files > exclusively. 1.2 is resizing the files to much larger sizes during a > compaction. You could be seeing a larger number of files than usual > participating in each compaction as the file sizes change. While this is > possible, I have doubts … hence this is a guess. > > Guess 2: I increased the various leveldb file sizes to reduce the number of > open and closes, both for writes and random reads. This helped latencies in > both the compactions and random reads. Any compaction in 1.2 is likely to > reread and write larger total number of bytes. While this is possible, I > again have doubts … the number of read operations should also go up if this > guess is correct. Your read operations have not increased. This guess might > still be valid if the read operations were satisfied by the Linux memory data > cache. I do not how those would be counted or not counted. > > > Matthew > > > On Nov 1, 2012, at 10:01 PM, Dietrich Featherston wrote: > >> Will check on that. >> >> Can you think of anything that would explain the 5x increase in disk writes >> we are seeing with the same workload? >> >> >> On Thu, Nov 1, 2012 at 6:03 PM, Matthew Von-Maszewski <matth...@basho.com> >> wrote: >> Look for any activity in the LOG. Level-0 "creations" are fast and not >> typically relevant. You would be most interested in LOG lines containing >> "Compacting" (start) and "Compacted" (end). The time in between will >> throttle. The idea is that these compaction events can pile up, one after >> another and multiple overlapping. It is these heavy times where the >> throttle saves the user experience. >> >> Matthew >> >> >> On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote: >> >>> Thanks. The amortized stalls may very well describe what we are seeing. If >>> I combine leveldb logs from all partitions on one of the upgraded nodes >>> what should I look for in terms of compaction activity to verify this? >>> >>> >>> On Thu, Nov 1, 2012 at 5:48 PM, Matthew Von-Maszewski <matth...@basho.com> >>> wrote: >>> Dietrich, >>> >>> I can see your concern with the write IOS statistic. Let me comment on the >>> easy question first: block_size. >>> >>> The block_size parameter in 1.1 was not getting passed to leveldb from the >>> erlang layer. You were using a 4096 byte block parameter no matter what >>> you typed in the app.config. The block_size is used by leveldb as a >>> threshold. Once you accumulate enough data above that threshold, the >>> current block is written to disk and a new one started. If you have 10k >>> data values, your get one data item per block and its size is ~10k. If you >>> have 1k data values, you get about four per block and the block is about 4k. >>> >>> We recommend 4k blocks to help read performance. The entire block has to >>> run through decompression and potentially CRC calculation when it comes off >>> the disk. That CPU time really kills any disk performance gains by having >>> larger blocks. Ok, that might change in 1.3 as we enable hardware CRC … >>> but only if you have "verify_checksums, true" in app.config. >>> >>> >>> Back to performance: I have not seen the change your graph details when >>> testing with SAS drives under moderate load. I am only today starting >>> qualification tests with SSD drives. >>> >>> But my 1.2 and 1.3 tests focus on drive / Riak saturation. 1.1 has the >>> nasty tendency to stall (intentionally) when we saturate the write side of >>> leveldb, . The stall was measured in seconds or even minutes in 1.1. >>> 1.2.1 has a write throttle that forecasts leveldb's stall state and >>> incrementally slows the individual writes to prevent the stalls. Maybe >>> that is what is being seen in the graph. The only way to know for sure is >>> to get an dump of your leveldb LOG files, combined them, then compare >>> compaction activity to your graph. >>> >>> Write stalls are detailed here: >>> http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/ >>> >>> How can I better assist you at this point? >>> >>> Matthew >>> >>> >>> On Nov 1, 2012, at 8:13 PM, Dietrich Featherston wrote: >>> >>>> We've just gone through the process of upgrading two riak clusters from >>>> 1.1 to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The >>>> process has gone smoothly and we see that latencies as measured at the >>>> gen_fsm level are largely unaffected. >>>> >>>> However, we are seeing some troubling disk statistics and I'm looking for >>>> an explanation before we upgrade the remainder of our nodes. The source of >>>> the worry seems to be a huge amplification in the number of writes >>>> serviced by the disk which may be the cause of rising io wait times. >>>> >>>> My first thought was that this could be due to some leveldb tuning in >>>> 1.2.1 which increases file sizes per the release notes >>>> (https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes >>>> that were upgraded yesterday are still showing this symptom. I would have >>>> expected any block re-writing to have subsided by now. >>>> >>>> Next hypothesis has to do with block size overriding in app.config. In >>>> 1.1, we had specified custom block sizes of 256k. We removed this prior to >>>> upgrading to 1.2.1 at the advice of #riak since block size configuration >>>> was ignored prior to 1.2 ('"block_size" parameter within app.config for >>>> leveldb was ignored. This parameter is now properly passed to leveldb.' >>>> --> >>>> https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105). >>>> I'm wondering if the block size parameter really was being passed to >>>> leveldb, and having removed it, blocks are now being rewritten to a new >>>> size, perhaps different from what they were being written as before >>>> (https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155) >>>> >>>> Here is the output of the following script showing the increased writes to >>>> disk (https://gist.github.com/37319a8ed2679bb8b21d) >>>> >>>> --an upgraded 1.2.1 node-- >>>> read ios: 238406742 >>>> write ios: 4814320281 >>>> read/write ratio: .04952033 >>>> avg wait: .10712340 >>>> read wait: .49174364 >>>> write wait: .42695475 >>>> >>>> >>>> --a node still running 1.1-- >>>> read ios: 267770032 >>>> write ios: 944170656 >>>> read/write ratio: .28360342 >>>> avg wait: .34237204 >>>> read wait: .47222371 >>>> write wait: 1.83283749 >>>> >>>> And here's what munin is showing us in terms of avg io wait times. >>>> >>>> <image.png> >>>> >>>> >>>> Any thoughts on what might explain this? >>>> >>>> Thanks, >>>> D >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> riak-users@lists.basho.com >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >> >> > > > <disk-latency-riak-1.2.1.tar.gz>
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com