Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Matthew Von-Maszewski Wed, 11 Dec 2013 06:18:29 -0800

The real Riak developers have arrived on-line for the day.  They are telling me 
that all of your problems are likely due to the extended upgrade times, and yes 
there is a known issue with handoff between 1.3 and 1.4.  They also say 
everything should calm down after all nodes are upgraded.


I will review your system settings now and see if there is something that might 
make the other machines upgrade quicker.  So three more questions:

- what is the average size of your keys

- what is the average size of your value (data stored)

- in regular use, are your keys accessed randomly across their entire range, or 
do they contain a date component which clusters older, less used keys

Matthew


On Dec 11, 2013, at 8:43 AM, Simon Effenberg <seffenb...@team.mobile.de> wrote:

> Oh and at the moment they are waiting for some handoffs and I see
> errors in logfiles:
> 
> 
> 2013-12-11 13:41:47.948 UTC [error]
> <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff
> transfer of riak_kv_vnode from 'riak@10.46.109.202'
> 468137243207554840987117797979434404733540892672
> 
> but I remember that somebody else had this as well and if I recall
> correctly it disappeared after the full upgrade was done.. but at the
> moment it's hard to think about upgrading everything at once..
> (~12hours 100% disk utilization on all 12 nodes will lead to real slow
> puts/gets)
> 
> What can I do?
> 
> Cheers
> Simon
> 
> PS: transfers output:
> 
> 'riak@10.46.109.202' waiting to handoff 17 partitions
> 'riak@10.46.109.201' waiting to handoff 19 partitions
> 
> (these are the 1.4.2 nodes)
> 
> 
> On Wed, 11 Dec 2013 14:39:58 +0100
> Simon Effenberg <seffenb...@team.mobile.de> wrote:
> 
>> Also some side notes:
>> 
>> "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO
>> utilization of disk is mostly the same (round about 33%)..
>> 
>> but
>> 
>> 95th percentile of response time for get (avg over all nodes):
>>  before upgrade: 29ms
>>  after upgrade: almost the same
>> 
>> 95th percentile of response time for put (avg over all nodes):
>>  before upgrade: 60ms
>>  after upgrade: 1548ms 
>>    but this is only because of 2 of 12 nodes are
>>    on 1.4.2 and are really slow (17000ms)
>> 
>> Cheers,
>> Simon
>> 
>> On Wed, 11 Dec 2013 13:45:56 +0100
>> Simon Effenberg <seffenb...@team.mobile.de> wrote:
>> 
>>> Sorry I forgot the half of it..
>>> 
>>> seffenberg@kriak46-1:~$ free -m
>>>             total       used       free     shared    buffers cached
>>> Mem:         23999      23759        239          0        184      16183
>>> -/+ buffers/cache:       7391      16607
>>> Swap:            0          0          0
>>> 
>>> We have 12 servers..
>>> datadir on the compacted servers (1.4.2) ~ 765 GB
>>> 
>>> AAE is enabled.
>>> 
>>> I attached app.config and vm.args.
>>> 
>>> Cheers
>>> Simon
>>> 
>>> On Wed, 11 Dec 2013 07:33:31 -0500
>>> Matthew Von-Maszewski <matth...@basho.com> wrote:
>>> 
>>>> Ok, I am now suspecting that your servers are either using swap space 
>>>> (which is slow) or your leveldb file cache is thrashing (opening and 
>>>> closing multiple files per request).
>>>> 
>>>> How many servers do you have and do you use Riak's active anti-entropy 
>>>> feature?  I am going to plug all of this into a spreadsheet. 
>>>> 
>>>> Matthew Von-Maszewski
>>>> 
>>>> 
>>>> On Dec 11, 2013, at 7:09, Simon Effenberg <seffenb...@team.mobile.de> 
>>>> wrote:
>>>> 
>>>>> Hi Matthew
>>>>> 
>>>>> Memory: 23999 MB
>>>>> 
>>>>> ring_creation_size, 256
>>>>> max_open_files, 100
>>>>> 
>>>>> riak-admin status:
>>>>> 
>>>>> memory_total : 276001360
>>>>> memory_processes : 191506322
>>>>> memory_processes_used : 191439568
>>>>> memory_system : 84495038
>>>>> memory_atom : 686993
>>>>> memory_atom_used : 686560
>>>>> memory_binary : 21965352
>>>>> memory_code : 11332732
>>>>> memory_ets : 10823528
>>>>> 
>>>>> Thanks for looking!
>>>>> 
>>>>> Cheers
>>>>> Simon
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, 11 Dec 2013 06:44:42 -0500
>>>>> Matthew Von-Maszewski <matth...@basho.com> wrote:
>>>>> 
>>>>>> I need to ask other developers as they arrive for the new day.  Does not 
>>>>>> make sense to me.
>>>>>> 
>>>>>> How many nodes do you have?  How much RAM do you have in each node?  
>>>>>> What are your settings for max_open_files and cache_size in the 
>>>>>> app.config file?  Maybe this is as simple as leveldb using too much RAM 
>>>>>> in 1.4.  The memory accounting for maz_open_files changed in 1.4.
>>>>>> 
>>>>>> Matthew Von-Maszewski
>>>>>> 
>>>>>> 
>>>>>> On Dec 11, 2013, at 6:28, Simon Effenberg <seffenb...@team.mobile.de> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Matthew,
>>>>>>> 
>>>>>>> it took around 11hours for the first node to finish the compaction. The
>>>>>>> second node is running already 12 hours and is still doing compaction.
>>>>>>> 
>>>>>>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is
>>>>>>> much higher (after the compaction) than on an old 1.3.1 (both are
>>>>>>> running in the cluster right now and another one is doing the
>>>>>>> compaction/upgrade while it is in the cluster but not directly
>>>>>>> accessible because it is out of the Loadbalancer):
>>>>>>> 
>>>>>>> 1.4.2:
>>>>>>> 
>>>>>>> node_put_fsm_time_mean : 2208050
>>>>>>> node_put_fsm_time_median : 39231
>>>>>>> node_put_fsm_time_95 : 17400382
>>>>>>> node_put_fsm_time_99 : 50965752
>>>>>>> node_put_fsm_time_100 : 59537762
>>>>>>> node_put_fsm_active : 5
>>>>>>> node_put_fsm_active_60s : 364
>>>>>>> node_put_fsm_in_rate : 5
>>>>>>> node_put_fsm_out_rate : 3
>>>>>>> node_put_fsm_rejected : 0
>>>>>>> node_put_fsm_rejected_60s : 0
>>>>>>> node_put_fsm_rejected_total : 0
>>>>>>> 
>>>>>>> 
>>>>>>> 1.3.1:
>>>>>>> 
>>>>>>> node_put_fsm_time_mean : 5036
>>>>>>> node_put_fsm_time_median : 1614
>>>>>>> node_put_fsm_time_95 : 8789
>>>>>>> node_put_fsm_time_99 : 38258
>>>>>>> node_put_fsm_time_100 : 384372
>>>>>>> 
>>>>>>> 
>>>>>>> any clue why this could/should be?
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Simon
>>>>>>> 
>>>>>>> On Tue, 10 Dec 2013 17:21:07 +0100
>>>>>>> Simon Effenberg <seffenb...@team.mobile.de> wrote:
>>>>>>> 
>>>>>>>> Hi Matthew,
>>>>>>>> 
>>>>>>>> thanks!.. that answers my questions!
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Simon
>>>>>>>> 
>>>>>>>> On Tue, 10 Dec 2013 11:08:32 -0500
>>>>>>>> Matthew Von-Maszewski <matth...@basho.com> wrote:
>>>>>>>> 
>>>>>>>>> 2i is not my expertise, so I had to discuss you concerns with another 
>>>>>>>>> Basho developer.  He says:
>>>>>>>>> 
>>>>>>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk 
>>>>>>>>> format.  You must wait for all nodes to update if you desire to use 
>>>>>>>>> the new 2i query.  The 2i data will properly write/update on both 1.3 
>>>>>>>>> and 1.4 machines during the migration.
>>>>>>>>> 
>>>>>>>>> Does that answer your question?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> And yes, you might see available disk space increase during the 
>>>>>>>>> upgrade compactions if your dataset contains numerous delete 
>>>>>>>>> "tombstones".  The Riak 2.0 code includes a new feature called 
>>>>>>>>> "aggressive delete" for leveldb.  This feature is more proactive in 
>>>>>>>>> pushing delete tombstones through the levels to free up disk space 
>>>>>>>>> much more quickly (especially if you perform block deletes every now 
>>>>>>>>> and then).
>>>>>>>>> 
>>>>>>>>> Matthew
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg 
>>>>>>>>> <seffenb...@team.mobile.de> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Matthew,
>>>>>>>>>> 
>>>>>>>>>> see inline..
>>>>>>>>>> 
>>>>>>>>>> On Tue, 10 Dec 2013 10:38:03 -0500
>>>>>>>>>> Matthew Von-Maszewski <matth...@basho.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> The sad truth is that you are not the first to see this problem.  
>>>>>>>>>>> And yes, it has to do with your 950GB per node dataset.  And no, 
>>>>>>>>>>> nothing to do but sit through it at this time.
>>>>>>>>>>> 
>>>>>>>>>>> While I did extensive testing around upgrade times before shipping 
>>>>>>>>>>> 1.4, apparently there are data configurations I did not anticipate. 
>>>>>>>>>>>  You are likely seeing a cascade where a shift of one file from 
>>>>>>>>>>> level-1 to level-2 is causing a shift of another file from level-2 
>>>>>>>>>>> to level-3, which causes a level-3 file to shift to level-4, etc … 
>>>>>>>>>>> then the next file shifts from level-1.
>>>>>>>>>>> 
>>>>>>>>>>> The bright side of this pain is that you will end up with better 
>>>>>>>>>>> write throughput once all the compaction ends.
>>>>>>>>>> 
>>>>>>>>>> I have to deal with that.. but my problem is now, if I'm doing this
>>>>>>>>>> node by node it looks like 2i searches aren't possible while 1.3 and
>>>>>>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads me 
>>>>>>>>>> to
>>>>>>>>>> an 2i repair marathon or could I easily wait for some hours for each
>>>>>>>>>> node until all merges are done before I upgrade the next one? (2i
>>>>>>>>>> searches can fail for some time.. the APP isn't having problems with
>>>>>>>>>> that but are new inserts with 2i indices processed successfully or do
>>>>>>>>>> I have to do the 2i repair?)
>>>>>>>>>> 
>>>>>>>>>> /s
>>>>>>>>>> 
>>>>>>>>>> one other good think: saving disk space is one advantage ;)..
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, 
>>>>>>>>>>> but that is not going to help you today.
>>>>>>>>>>> 
>>>>>>>>>>> Matthew
>>>>>>>>>>> 
>>>>>>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg 
>>>>>>>>>>> <seffenb...@team.mobile.de> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi @list,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after
>>>>>>>>>>>> upgrading the first node (out of 12) this node seems to do many 
>>>>>>>>>>>> merges.
>>>>>>>>>>>> the sst_* directories changes in size "rapidly" and the node is 
>>>>>>>>>>>> having
>>>>>>>>>>>> a disk utilization of 100% all the time.
>>>>>>>>>>>> 
>>>>>>>>>>>> I know that there is something like that:
>>>>>>>>>>>> 
>>>>>>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x 
>>>>>>>>>>>> dataset
>>>>>>>>>>>> will initiate an automatic conversion that could pause the startup 
>>>>>>>>>>>> of
>>>>>>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is 
>>>>>>>>>>>> being
>>>>>>>>>>>> adjusted such that "level #1" can operate as an overlapped data 
>>>>>>>>>>>> level
>>>>>>>>>>>> instead of as a sorted data level. The conversion is simply the
>>>>>>>>>>>> reduction of the number of files in "level #1" to being less than 
>>>>>>>>>>>> eight
>>>>>>>>>>>> via normal compaction of data from "level #1" into "level #2". 
>>>>>>>>>>>> This is
>>>>>>>>>>>> a one time conversion."
>>>>>>>>>>>> 
>>>>>>>>>>>> but it looks much more invasive than explained here or doesn't 
>>>>>>>>>>>> have to
>>>>>>>>>>>> do anything with the (probably seen) merges.
>>>>>>>>>>>> 
>>>>>>>>>>>> Is this "normal" behavior or could I do anything about it?
>>>>>>>>>>>> 
>>>>>>>>>>>> At the moment I'm stucked with the upgrade procedure because this 
>>>>>>>>>>>> high
>>>>>>>>>>>> IO load would probably lead to high response times.
>>>>>>>>>>>> 
>>>>>>>>>>>> Also we have a lot of data (per node ~950 GB).
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Simon
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> riak-users mailing list
>>>>>>>>>>>> riak-users@lists.basho.com
>>>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
>>>>>>>>>> Fon:     + 49-(0)30-8109 - 7173
>>>>>>>>>> Fax:     + 49-(0)30-8109 - 7131
>>>>>>>>>> 
>>>>>>>>>> Mail:     seffenb...@team.mobile.de
>>>>>>>>>> Web:    www.mobile.de
>>>>>>>>>> 
>>>>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Geschäftsführer: Malte Krüger
>>>>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
>>>>>>>>>> Sitz der Gesellschaft: Kleinmachnow 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
>>>>>>>> Fon:     + 49-(0)30-8109 - 7173
>>>>>>>> Fax:     + 49-(0)30-8109 - 7131
>>>>>>>> 
>>>>>>>> Mail:     seffenb...@team.mobile.de
>>>>>>>> Web:    www.mobile.de
>>>>>>>> 
>>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Geschäftsführer: Malte Krüger
>>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
>>>>>>>> Sitz der Gesellschaft: Kleinmachnow 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> riak-users mailing list
>>>>>>>> riak-users@lists.basho.com
>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
>>>>>>> Fon:     + 49-(0)30-8109 - 7173
>>>>>>> Fax:     + 49-(0)30-8109 - 7131
>>>>>>> 
>>>>>>> Mail:     seffenb...@team.mobile.de
>>>>>>> Web:    www.mobile.de
>>>>>>> 
>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>>>>>>> 
>>>>>>> 
>>>>>>> Geschäftsführer: Malte Krüger
>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
>>>>>>> Sitz der Gesellschaft: Kleinmachnow 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
>>>>> Fon:     + 49-(0)30-8109 - 7173
>>>>> Fax:     + 49-(0)30-8109 - 7131
>>>>> 
>>>>> Mail:     seffenb...@team.mobile.de
>>>>> Web:    www.mobile.de
>>>>> 
>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>>>>> 
>>>>> 
>>>>> Geschäftsführer: Malte Krüger
>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
>>>>> Sitz der Gesellschaft: Kleinmachnow 
>>> 
>>> 
>>> -- 
>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
>>> Fon:     + 49-(0)30-8109 - 7173
>>> Fax:     + 49-(0)30-8109 - 7131
>>> 
>>> Mail:     seffenb...@team.mobile.de
>>> Web:    www.mobile.de
>>> 
>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>>> 
>>> 
>>> Geschäftsführer: Malte Krüger
>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
>>> Sitz der Gesellschaft: Kleinmachnow 
>> 
>> 
>> -- 
>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
>> Fon:     + 49-(0)30-8109 - 7173
>> Fax:     + 49-(0)30-8109 - 7131
>> 
>> Mail:     seffenb...@team.mobile.de
>> Web:    www.mobile.de
>> 
>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>> 
>> 
>> Geschäftsführer: Malte Krüger
>> HRB Nr.: 18517 P, Amtsgericht Potsdam
>> Sitz der Gesellschaft: Kleinmachnow 
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     seffenb...@team.mobile.de
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Reply via email to