Hi, Matthew thank you for information.
On Thu, Oct 29, 2015 at 8:19 PM Matthew Von-Maszewski <matth...@basho.com> wrote: > I queried Basho’s Client Services team. They tell me the upgrade / > coexist should be no problem. > > Matthew > > On Oct 29, 2015, at 1:38 PM, Vladyslav Zakhozhai < > v.zakhoz...@smartweb.com.ua> wrote: > > Matthew can you describe the bug more detail? > > My plan was to migrate to eleveldb and only then to migrate to Riak 2.0. > It seems that I need to change my plans to migrate to Riak 2.0 first. It is > sad. > > Is it safe to migrate Riak 1.4.12/Riak CS 1.5.0 to Riak 2.0 on production > environment? According to official upgrade guides I can upgrade nodes one > by one in the same cluster. So Riak 2.0 and Riak 1.4.12 nodes can coexist > in one cluster. Am I right? > > Thank you. > > On Thu, Oct 29, 2015 at 7:04 PM Matthew Von-Maszewski <matth...@basho.com> > wrote: > >> Sad to say your LOG files suggest the same bug as seen elsewhere and >> fixed by recent changes in the leveldb code. >> >> The tougher issue is that the fixes are currently only available for our >> 2.0 product series. A backport would be non-trivial due to the number of >> places changed between 1.4 and 2.0 and the number of places the fix >> overlaps those changes. The corrected code is tagged “2.0.9” in eleveldb >> and leveldb. >> >> The only path readily available to you is to have your receiving cluster >> upgraded to 2.0 Riak CS and manually build/patch eleveldb to the 2.0.9 >> version. Then start your handoffs. (eleveldb version 2.0.9 is not present >> in any shipping version of Riak … yet). >> >> I will write again if I can think of an easier solution. But nothing is >> occurring to me or the team members I have queried. >> >> Matthew >> >> On Oct 29, 2015, at 12:14 PM, Vladyslav Zakhozhai < >> v.zakhoz...@smartweb.com.ua> wrote: >> >> Hi, >> >> Matthew thank for you answer. eleveldb LOGs are attached. >> Here is LOGs from 2 eleveldb nodes (eggeater was not restarted; what >> about rattlesnake I'm not sure). >> >> On Thu, Oct 29, 2015 at 5:24 PM Matthew Von-Maszewski <matth...@basho.com> >> wrote: >> >>> Hi, >>> >>> There was a known eleveldb bug with handoff receiving that could cause a >>> timeout. But it does not sound like bug fits your symptoms. However, I am >>> willing to verify my diagnosis. I would need you to gather the LOG files >>> from all vnodes on the RECEIVING side (or at least from the vnode that you >>> are attempting and is failing). >>> >>> I will check it for the symptoms of the known bug. >>> >>> Note: the LOG files reset on each restart of Riak. So you must gather >>> the LOG files right after the failure without restarting Riak. >>> >>> Matthew >>> >>> >>> On Oct 29, 2015, at 11:11 AM, Vladyslav Zakhozhai < >>> v.zakhoz...@smartweb.com.ua> wrote: >>> >>> Hi, >>> >>> I want to make small update. Jon your hint about problems on sender side >>> is correct. As I've already told there problems with available resources on >>> sender nodes. There are no enough available RAM which is a cause of >>> swapiness and load on disks. Restarting of sender nodes helps me (at least >>> temoprarily). >>> >>> >>> On Thu, Oct 29, 2015 at 12:19 PM Vladyslav Zakhozhai < >>> v.zakhoz...@smartweb.com.ua> wrote: >>> >>>> Hi, >>>> >>>> Average size of objects in Riak - 300 Kb. This objects are images. This >>>> data updates very very rearly (there almost no updates). >>>> >>>> I have GC turned on and works: >>>> root@python:~# riak-cs-gc status >>>> There is no garbage collection in progress >>>> The current garbage collection interval is: 900 >>>> The current garbage collection leeway time is: 86400 >>>> Last run started at: 20151029T100600Z >>>> Next run scheduled for: 20151029T102100Z >>>> >>>> Network misconfigurations were not detected. The result of your script >>>> shows correct info. >>>> >>>> But I see that almost all nodes with bitcask suffers from low free >>>> memory and they swapped. I think that it can be an issue. But my question >>>> is, what workaround is for this problem. >>>> >>>> I've wrote in my first post that I tuned handoff_timeout and >>>> handoff_receive_timeout (now this vaules are 300000 and 600000). But >>>> situation is the same. >>>> >>>> >>>> On Tue, Oct 27, 2015 at 4:06 PM Jon Meredith <jmered...@basho.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Handoff problems without obvious disk issues can be due to the >>>>> database containing large objects. Do you frequently update objects in >>>>> CS, >>>>> and if so have you had garbage collection running? >>>>> >>>>> The timeout is happening on the receiver side after not receiving any >>>>> tcp data for handoff_receive_timeout *milli*seconds. I know you said you >>>>> increased it, but not how high. I would bump that up to 300000 to give >>>>> the >>>>> sender a chance to read larger objects off disk. >>>>> >>>>> To check if the sender is transmitting, on the source node you could >>>>> run >>>>> redbug:start("riak_core_handoff_sender:visit_item", [{arity, >>>>> true},{print_file,"/tmp/visit_item.log"},{time, 3600000},{msgs, >>>>> 1000000}]). >>>>> >>>>> That file should fill fairly fast with an entry for every object the >>>>> sender tries to transmit. >>>>> >>>>> There's a long shot it could be network misconfiguration. Run this >>>>> from the source node having problems >>>>> >>>>> rpc:multicall(erlang, apply, [fun() -> TargetNode = node(), >>>>> [_Name,Host] = string:tokens(atom_to_list(TargetNode), "@"), {ok, Port} = >>>>> riak_core_gen_server:call({riak_core_handoff_listener, TargetNode}, >>>>> handoff_port), HandoffIP = riak_core_handoff_listener:get_handoff_ip(), >>>>> TNHandoffIP = case HandoffIP of error -> Host; {ok, "0.0.0.0"} -> Host; >>>>> {ok, Other} -> Other end, {node(), HandoffIP, TNHandoffIP, >>>>> inet:gethostbyname(TNHandoffIP), Port} end, []]). >>>>> >>>>> and it will print out a a list of remote nodes and IP addresses (and >>>>> hopefully an empty list of failed nodes) >>>>> >>>>> {[{'dev1@127.0.0.1', <---- node name >>>>> {ok,"0.0.0.0"}, <---- handoff ip address configured in >>>>> app.config >>>>> "127.0.0.1", <---- hostname passed to socket open >>>>> {ok,{hostent,"127.0.0.1",[],inet,4,[{127,0,0,1}]}}, <--- DNS entry >>>>> for hostname >>>>> 10019}], <---- handoff port >>>>> []} <--- empty list of errors >>>>> >>>>> Good luck, Jon. >>>>> >>>>> On Tue, Oct 27, 2015 at 3:55 AM Vladyslav Zakhozhai < >>>>> v.zakhoz...@smartweb.com.ua> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Jon thank you for the answer. During approval of my mail to this list >>>>>> I've troubleshoot my issue more deep. And yes, your are right. Neither >>>>>> {error, enotconn} nor max_concurrency is my problem. >>>>>> >>>>>> I'm going to migrate my cluster entierly to eleveldb only, i.e. I >>>>>> need to refuse using bitcask. I have a talk with basho support and they >>>>>> said that it is tricky to tune bitcask on servers with 32 GB RAM (and I >>>>>> guess that it is not tricky, but it is impossible, because bitcask loads >>>>>> all keys in memory regardless of free available RAM). With LevelDB I have >>>>>> opportunity to tune using RAM on servers. >>>>>> >>>>>> So I have 15 nodes with multibackend (bitcask for data and leveldb >>>>>> for metadata). 2 additional servers are without multibackend - only with >>>>>> leveldb. Now I'm not sure do I need still use mutibackend with >>>>>> levedb-only >>>>>> backend. >>>>>> >>>>>> And my problem is (as I mentioned earlier) the following. On >>>>>> leveldb-only nodes I see handoffs timedout and no further progress. >>>>>> >>>>>> On multibackend hosts I have configuration: >>>>>> >>>>>> {riak_kv, [ >>>>>> {add_paths, ["/usr/lib/riak-cs/lib/riak_cs-1.5.0/ebin"]}, >>>>>> {storage_backend, riak_cs_kv_multi_backend}, >>>>>> {multi_backend_prefix_list, [{<<"0b:">>, be_blocks}]}, >>>>>> {multi_backend_default, be_default}, >>>>>> {multi_backend, [ >>>>>> {be_default, riak_kv_eleveldb_backend, [ >>>>>> {max_open_files, 50}, >>>>>> {data_root, "/var/lib/riak/leveldb"} >>>>>> ]}, >>>>>> {be_blocks, riak_kv_bitcask_backend, [ >>>>>> {data_root, "/var/lib/riak/bitcask"} >>>>>> ]} >>>>>> ]}, >>>>>> >>>>>> And for hosts with leveldb-only backend: >>>>>> >>>>>> {riak_kv, [ >>>>>> {storage_backend, riak_kv_eleveldb_backend}, >>>>>> ... >>>>>> {eleveldb, [ >>>>>> {data_root, "/var/lib/riak/leveldb"} >>>>>> (default values for leveldb) >>>>>> >>>>>> In leveldb logs I see nothing that could help me (no errors in logs). >>>>>> >>>>>> >>>>>> On Mon, Oct 26, 2015 at 3:57 PM Jon Meredith <jmered...@basho.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I suspect your {error,enotconn} messages are unrelated - that's >>>>>>> likely to be caused by an HTTP client closing the connection while Riak >>>>>>> looks up some networking information about the requestor. >>>>>>> >>>>>>> The max_concurrency message you are seeing is related to the handoff >>>>>>> transfer limit - it should be labelled as informational. When a node has >>>>>>> data to handoff it starts the handoff sender process and if there are >>>>>>> either too many local handoff processes or too many on the remote side >>>>>>> it >>>>>>> exits with max_concurrency. You could increase with riak-admin >>>>>>> transfer-limit but that probably won't help if you're timing out. >>>>>>> >>>>>>> As you're using the multi-backend you're transferring data from >>>>>>> bitcask and leveldb. The next place I would look is in the leveldb LOG >>>>>>> files to see if there are any leveldb vnodes that are having problems >>>>>>> that's preventing repair. >>>>>>> >>>>>>> Jon >>>>>>> >>>>>>> On Mon, Oct 26, 2015 at 7:15 AM Vladyslav Zakhozhai < >>>>>>> v.zakhoz...@smartweb.com.ua> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I have a problem with persistent timeouts during ownership >>>>>>>> handoffs. I've tried to surf over Internet and current mail list but no >>>>>>>> success. >>>>>>>> >>>>>>>> I have Riak 1.4.12 cluster with 17 nodes. Almost all nodes use >>>>>>>> multibackend with bitcask and eleveldb as storage backends (we need >>>>>>>> multiple backend for Riak CS 1.5.0 integration). >>>>>>>> >>>>>>>> Now I'm working to migrate Riak cluster to eleveldb as primary and >>>>>>>> only backend. For now I have 2 nodes with eleveldb backend in the same >>>>>>>> cluster. >>>>>>>> >>>>>>>> During ownership handoff process I permanently see errors of timed >>>>>>>> out handoff receivers and sender. >>>>>>>> >>>>>>>> Here is partial output of riak-admin transfers: >>>>>>>> ... >>>>>>>> transfer type: ownership_transfer >>>>>>>> vnode type: riak_kv_vnode >>>>>>>> partition: 331121464707782692405522344912282871640797216768 >>>>>>>> started: 2015-10-21 08:32:55 [46.66 min ago] >>>>>>>> last update: no updates seen >>>>>>>> total size: unknown >>>>>>>> objects transferred: unknown >>>>>>>> >>>>>>>> unknown >>>>>>>> riak@taipan.pleiad.uaprom =======> r...@eggeater.pleiad.uapr >>>>>>>> om >>>>>>>> | | 0% >>>>>>>> unknown >>>>>>>> >>>>>>>> transfer type: ownership_transfer >>>>>>>> vnode type: riak_kv_vnode >>>>>>>> partition: 336830455478606531929755488790080852186328203264 >>>>>>>> started: 2015-10-21 08:32:54 [46.68 min ago] >>>>>>>> last update: no updates seen >>>>>>>> total size: unknown >>>>>>>> objects transferred: unknown >>>>>>>> ... >>>>>>>> >>>>>>>> Some of partition handoffs state never updates, some of them >>>>>>>> terminates after partial handoff objects and never starts again. >>>>>>>> >>>>>>>> I see nothing in logs but following: >>>>>>>> >>>>>>>> On receiver side: >>>>>>>> >>>>>>>> 2015-10-21 11:33:55.131 [error] >>>>>>>> <0.25390.1266>@riak_core_handoff_receiver:handle_info:105 Handoff >>>>>>>> receiver >>>>>>>> for partition 331121464707782692405522344912282871640797216768 timed >>>>>>>> out >>>>>>>> after processing 0 objects. >>>>>>>> >>>>>>>> On sender side: >>>>>>>> >>>>>>>> 2015-10-21 11:01:58.879 [error] <0.13177.1401> CRASH REPORT Process >>>>>>>> <0.13177.1401> with 0 neighbours crashed with reason: no function >>>>>>>> clause >>>>>>>> matching webmachine_request:peer_from_peername({error,enotconn}, >>>>>>>> {webmachine_request,{wm_reqstate,#Port<0.50978116>,[],undefined,undefined,undefined,{wm_reqdata,...},...}}) >>>>>>>> line 150 >>>>>>>> 2015-10-21 11:32:50.055 [error] <0.207.0> Supervisor >>>>>>>> riak_core_handoff_sender_sup had child riak_core_handoff_sender started >>>>>>>> with {riak_core_handoff_sender,start_link,undefined} at <0.22312.1090> >>>>>>>> exit >>>>>>>> with reason max_concurrency in context child_terminated >>>>>>>> >>>>>>>> {error, enotconn} - seems to be network issue. But I have no any >>>>>>>> problems with network. All hosts resolve their neighbors correctly and >>>>>>>> /etc/hosts on each node are correct. >>>>>>>> >>>>>>>> I've tried to increase handoff_timeout and handoff_receive_timeout. >>>>>>>> But no success. >>>>>>>> >>>>>>>> Forcing handoff helped me but for short period of time: >>>>>>>> >>>>>>>> rpc:multicall([node() | nodes()], riak_core_vnode_manager, >>>>>>>> force_handoffs, []). >>>>>>>> >>>>>>>> >>>>>>>> I see progress of handoffs (riak-admin transfers) but then I see >>>>>>>> handoff timed out again. >>>>>>>> >>>>>>>> >>>>>>>> A week ago I've joined 4 nodes with bitcask. And there was no such >>>>>>>> problems. >>>>>>>> >>>>>>>> >>>>>>>> I'm confused a little bit and need to understand my next steps in >>>>>>>> troubleshooting this issue. >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> riak-users mailing list >>>>>>>> riak-users@lists.basho.com >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>>> >>>>>>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>> <eggeater-leveldb-logs-old.tar.gz><rattlesnake-leveldb-logs-old.tar.gz> >> <rattlesnake-leveldb-logs.tar.gz><eggeater-leveldb-logs.tar.gz> >> >> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com