The real Riak developers have arrived on-line for the day. They are telling me that all of your problems are likely due to the extended upgrade times, and yes there is a known issue with handoff between 1.3 and 1.4. They also say everything should calm down after all nodes are upgraded.
I will review your system settings now and see if there is something that might make the other machines upgrade quicker. So three more questions: - what is the average size of your keys - what is the average size of your value (data stored) - in regular use, are your keys accessed randomly across their entire range, or do they contain a date component which clusters older, less used keys Matthew On Dec 11, 2013, at 8:43 AM, Simon Effenberg <seffenb...@team.mobile.de> wrote: > Oh and at the moment they are waiting for some handoffs and I see > errors in logfiles: > > > 2013-12-11 13:41:47.948 UTC [error] > <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff > transfer of riak_kv_vnode from 'riak@10.46.109.202' > 468137243207554840987117797979434404733540892672 > > but I remember that somebody else had this as well and if I recall > correctly it disappeared after the full upgrade was done.. but at the > moment it's hard to think about upgrading everything at once.. > (~12hours 100% disk utilization on all 12 nodes will lead to real slow > puts/gets) > > What can I do? > > Cheers > Simon > > PS: transfers output: > > 'riak@10.46.109.202' waiting to handoff 17 partitions > 'riak@10.46.109.201' waiting to handoff 19 partitions > > (these are the 1.4.2 nodes) > > > On Wed, 11 Dec 2013 14:39:58 +0100 > Simon Effenberg <seffenb...@team.mobile.de> wrote: > >> Also some side notes: >> >> "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO >> utilization of disk is mostly the same (round about 33%).. >> >> but >> >> 95th percentile of response time for get (avg over all nodes): >> before upgrade: 29ms >> after upgrade: almost the same >> >> 95th percentile of response time for put (avg over all nodes): >> before upgrade: 60ms >> after upgrade: 1548ms >> but this is only because of 2 of 12 nodes are >> on 1.4.2 and are really slow (17000ms) >> >> Cheers, >> Simon >> >> On Wed, 11 Dec 2013 13:45:56 +0100 >> Simon Effenberg <seffenb...@team.mobile.de> wrote: >> >>> Sorry I forgot the half of it.. >>> >>> seffenberg@kriak46-1:~$ free -m >>> total used free shared buffers cached >>> Mem: 23999 23759 239 0 184 16183 >>> -/+ buffers/cache: 7391 16607 >>> Swap: 0 0 0 >>> >>> We have 12 servers.. >>> datadir on the compacted servers (1.4.2) ~ 765 GB >>> >>> AAE is enabled. >>> >>> I attached app.config and vm.args. >>> >>> Cheers >>> Simon >>> >>> On Wed, 11 Dec 2013 07:33:31 -0500 >>> Matthew Von-Maszewski <matth...@basho.com> wrote: >>> >>>> Ok, I am now suspecting that your servers are either using swap space >>>> (which is slow) or your leveldb file cache is thrashing (opening and >>>> closing multiple files per request). >>>> >>>> How many servers do you have and do you use Riak's active anti-entropy >>>> feature? I am going to plug all of this into a spreadsheet. >>>> >>>> Matthew Von-Maszewski >>>> >>>> >>>> On Dec 11, 2013, at 7:09, Simon Effenberg <seffenb...@team.mobile.de> >>>> wrote: >>>> >>>>> Hi Matthew >>>>> >>>>> Memory: 23999 MB >>>>> >>>>> ring_creation_size, 256 >>>>> max_open_files, 100 >>>>> >>>>> riak-admin status: >>>>> >>>>> memory_total : 276001360 >>>>> memory_processes : 191506322 >>>>> memory_processes_used : 191439568 >>>>> memory_system : 84495038 >>>>> memory_atom : 686993 >>>>> memory_atom_used : 686560 >>>>> memory_binary : 21965352 >>>>> memory_code : 11332732 >>>>> memory_ets : 10823528 >>>>> >>>>> Thanks for looking! >>>>> >>>>> Cheers >>>>> Simon >>>>> >>>>> >>>>> >>>>> On Wed, 11 Dec 2013 06:44:42 -0500 >>>>> Matthew Von-Maszewski <matth...@basho.com> wrote: >>>>> >>>>>> I need to ask other developers as they arrive for the new day. Does not >>>>>> make sense to me. >>>>>> >>>>>> How many nodes do you have? How much RAM do you have in each node? >>>>>> What are your settings for max_open_files and cache_size in the >>>>>> app.config file? Maybe this is as simple as leveldb using too much RAM >>>>>> in 1.4. The memory accounting for maz_open_files changed in 1.4. >>>>>> >>>>>> Matthew Von-Maszewski >>>>>> >>>>>> >>>>>> On Dec 11, 2013, at 6:28, Simon Effenberg <seffenb...@team.mobile.de> >>>>>> wrote: >>>>>> >>>>>>> Hi Matthew, >>>>>>> >>>>>>> it took around 11hours for the first node to finish the compaction. The >>>>>>> second node is running already 12 hours and is still doing compaction. >>>>>>> >>>>>>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is >>>>>>> much higher (after the compaction) than on an old 1.3.1 (both are >>>>>>> running in the cluster right now and another one is doing the >>>>>>> compaction/upgrade while it is in the cluster but not directly >>>>>>> accessible because it is out of the Loadbalancer): >>>>>>> >>>>>>> 1.4.2: >>>>>>> >>>>>>> node_put_fsm_time_mean : 2208050 >>>>>>> node_put_fsm_time_median : 39231 >>>>>>> node_put_fsm_time_95 : 17400382 >>>>>>> node_put_fsm_time_99 : 50965752 >>>>>>> node_put_fsm_time_100 : 59537762 >>>>>>> node_put_fsm_active : 5 >>>>>>> node_put_fsm_active_60s : 364 >>>>>>> node_put_fsm_in_rate : 5 >>>>>>> node_put_fsm_out_rate : 3 >>>>>>> node_put_fsm_rejected : 0 >>>>>>> node_put_fsm_rejected_60s : 0 >>>>>>> node_put_fsm_rejected_total : 0 >>>>>>> >>>>>>> >>>>>>> 1.3.1: >>>>>>> >>>>>>> node_put_fsm_time_mean : 5036 >>>>>>> node_put_fsm_time_median : 1614 >>>>>>> node_put_fsm_time_95 : 8789 >>>>>>> node_put_fsm_time_99 : 38258 >>>>>>> node_put_fsm_time_100 : 384372 >>>>>>> >>>>>>> >>>>>>> any clue why this could/should be? >>>>>>> >>>>>>> Cheers >>>>>>> Simon >>>>>>> >>>>>>> On Tue, 10 Dec 2013 17:21:07 +0100 >>>>>>> Simon Effenberg <seffenb...@team.mobile.de> wrote: >>>>>>> >>>>>>>> Hi Matthew, >>>>>>>> >>>>>>>> thanks!.. that answers my questions! >>>>>>>> >>>>>>>> Cheers >>>>>>>> Simon >>>>>>>> >>>>>>>> On Tue, 10 Dec 2013 11:08:32 -0500 >>>>>>>> Matthew Von-Maszewski <matth...@basho.com> wrote: >>>>>>>> >>>>>>>>> 2i is not my expertise, so I had to discuss you concerns with another >>>>>>>>> Basho developer. He says: >>>>>>>>> >>>>>>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk >>>>>>>>> format. You must wait for all nodes to update if you desire to use >>>>>>>>> the new 2i query. The 2i data will properly write/update on both 1.3 >>>>>>>>> and 1.4 machines during the migration. >>>>>>>>> >>>>>>>>> Does that answer your question? >>>>>>>>> >>>>>>>>> >>>>>>>>> And yes, you might see available disk space increase during the >>>>>>>>> upgrade compactions if your dataset contains numerous delete >>>>>>>>> "tombstones". The Riak 2.0 code includes a new feature called >>>>>>>>> "aggressive delete" for leveldb. This feature is more proactive in >>>>>>>>> pushing delete tombstones through the levels to free up disk space >>>>>>>>> much more quickly (especially if you perform block deletes every now >>>>>>>>> and then). >>>>>>>>> >>>>>>>>> Matthew >>>>>>>>> >>>>>>>>> >>>>>>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg >>>>>>>>> <seffenb...@team.mobile.de> wrote: >>>>>>>>> >>>>>>>>>> Hi Matthew, >>>>>>>>>> >>>>>>>>>> see inline.. >>>>>>>>>> >>>>>>>>>> On Tue, 10 Dec 2013 10:38:03 -0500 >>>>>>>>>> Matthew Von-Maszewski <matth...@basho.com> wrote: >>>>>>>>>> >>>>>>>>>>> The sad truth is that you are not the first to see this problem. >>>>>>>>>>> And yes, it has to do with your 950GB per node dataset. And no, >>>>>>>>>>> nothing to do but sit through it at this time. >>>>>>>>>>> >>>>>>>>>>> While I did extensive testing around upgrade times before shipping >>>>>>>>>>> 1.4, apparently there are data configurations I did not anticipate. >>>>>>>>>>> You are likely seeing a cascade where a shift of one file from >>>>>>>>>>> level-1 to level-2 is causing a shift of another file from level-2 >>>>>>>>>>> to level-3, which causes a level-3 file to shift to level-4, etc … >>>>>>>>>>> then the next file shifts from level-1. >>>>>>>>>>> >>>>>>>>>>> The bright side of this pain is that you will end up with better >>>>>>>>>>> write throughput once all the compaction ends. >>>>>>>>>> >>>>>>>>>> I have to deal with that.. but my problem is now, if I'm doing this >>>>>>>>>> node by node it looks like 2i searches aren't possible while 1.3 and >>>>>>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads me >>>>>>>>>> to >>>>>>>>>> an 2i repair marathon or could I easily wait for some hours for each >>>>>>>>>> node until all merges are done before I upgrade the next one? (2i >>>>>>>>>> searches can fail for some time.. the APP isn't having problems with >>>>>>>>>> that but are new inserts with 2i indices processed successfully or do >>>>>>>>>> I have to do the 2i repair?) >>>>>>>>>> >>>>>>>>>> /s >>>>>>>>>> >>>>>>>>>> one other good think: saving disk space is one advantage ;).. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, >>>>>>>>>>> but that is not going to help you today. >>>>>>>>>>> >>>>>>>>>>> Matthew >>>>>>>>>>> >>>>>>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg >>>>>>>>>>> <seffenb...@team.mobile.de> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi @list, >>>>>>>>>>>> >>>>>>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after >>>>>>>>>>>> upgrading the first node (out of 12) this node seems to do many >>>>>>>>>>>> merges. >>>>>>>>>>>> the sst_* directories changes in size "rapidly" and the node is >>>>>>>>>>>> having >>>>>>>>>>>> a disk utilization of 100% all the time. >>>>>>>>>>>> >>>>>>>>>>>> I know that there is something like that: >>>>>>>>>>>> >>>>>>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x >>>>>>>>>>>> dataset >>>>>>>>>>>> will initiate an automatic conversion that could pause the startup >>>>>>>>>>>> of >>>>>>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is >>>>>>>>>>>> being >>>>>>>>>>>> adjusted such that "level #1" can operate as an overlapped data >>>>>>>>>>>> level >>>>>>>>>>>> instead of as a sorted data level. The conversion is simply the >>>>>>>>>>>> reduction of the number of files in "level #1" to being less than >>>>>>>>>>>> eight >>>>>>>>>>>> via normal compaction of data from "level #1" into "level #2". >>>>>>>>>>>> This is >>>>>>>>>>>> a one time conversion." >>>>>>>>>>>> >>>>>>>>>>>> but it looks much more invasive than explained here or doesn't >>>>>>>>>>>> have to >>>>>>>>>>>> do anything with the (probably seen) merges. >>>>>>>>>>>> >>>>>>>>>>>> Is this "normal" behavior or could I do anything about it? >>>>>>>>>>>> >>>>>>>>>>>> At the moment I'm stucked with the upgrade procedure because this >>>>>>>>>>>> high >>>>>>>>>>>> IO load would probably lead to high response times. >>>>>>>>>>>> >>>>>>>>>>>> Also we have a lot of data (per node ~950 GB). >>>>>>>>>>>> >>>>>>>>>>>> Cheers >>>>>>>>>>>> Simon >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> riak-users mailing list >>>>>>>>>>>> riak-users@lists.basho.com >>>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH >>>>>>>>>> Fon: + 49-(0)30-8109 - 7173 >>>>>>>>>> Fax: + 49-(0)30-8109 - 7131 >>>>>>>>>> >>>>>>>>>> Mail: seffenb...@team.mobile.de >>>>>>>>>> Web: www.mobile.de >>>>>>>>>> >>>>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Geschäftsführer: Malte Krüger >>>>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam >>>>>>>>>> Sitz der Gesellschaft: Kleinmachnow >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH >>>>>>>> Fon: + 49-(0)30-8109 - 7173 >>>>>>>> Fax: + 49-(0)30-8109 - 7131 >>>>>>>> >>>>>>>> Mail: seffenb...@team.mobile.de >>>>>>>> Web: www.mobile.de >>>>>>>> >>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany >>>>>>>> >>>>>>>> >>>>>>>> Geschäftsführer: Malte Krüger >>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam >>>>>>>> Sitz der Gesellschaft: Kleinmachnow >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> riak-users mailing list >>>>>>>> riak-users@lists.basho.com >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH >>>>>>> Fon: + 49-(0)30-8109 - 7173 >>>>>>> Fax: + 49-(0)30-8109 - 7131 >>>>>>> >>>>>>> Mail: seffenb...@team.mobile.de >>>>>>> Web: www.mobile.de >>>>>>> >>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany >>>>>>> >>>>>>> >>>>>>> Geschäftsführer: Malte Krüger >>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam >>>>>>> Sitz der Gesellschaft: Kleinmachnow >>>>> >>>>> >>>>> -- >>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH >>>>> Fon: + 49-(0)30-8109 - 7173 >>>>> Fax: + 49-(0)30-8109 - 7131 >>>>> >>>>> Mail: seffenb...@team.mobile.de >>>>> Web: www.mobile.de >>>>> >>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany >>>>> >>>>> >>>>> Geschäftsführer: Malte Krüger >>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam >>>>> Sitz der Gesellschaft: Kleinmachnow >>> >>> >>> -- >>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH >>> Fon: + 49-(0)30-8109 - 7173 >>> Fax: + 49-(0)30-8109 - 7131 >>> >>> Mail: seffenb...@team.mobile.de >>> Web: www.mobile.de >>> >>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany >>> >>> >>> Geschäftsführer: Malte Krüger >>> HRB Nr.: 18517 P, Amtsgericht Potsdam >>> Sitz der Gesellschaft: Kleinmachnow >> >> >> -- >> Simon Effenberg | Site Ops Engineer | mobile.international GmbH >> Fon: + 49-(0)30-8109 - 7173 >> Fax: + 49-(0)30-8109 - 7131 >> >> Mail: seffenb...@team.mobile.de >> Web: www.mobile.de >> >> Marktplatz 1 | 14532 Europarc Dreilinden | Germany >> >> >> Geschäftsführer: Malte Krüger >> HRB Nr.: 18517 P, Amtsgericht Potsdam >> Sitz der Gesellschaft: Kleinmachnow > > > -- > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > Fon: + 49-(0)30-8109 - 7173 > Fax: + 49-(0)30-8109 - 7131 > > Mail: seffenb...@team.mobile.de > Web: www.mobile.de > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > Geschäftsführer: Malte Krüger > HRB Nr.: 18517 P, Amtsgericht Potsdam > Sitz der Gesellschaft: Kleinmachnow _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com