Re: [Sequoia] Failure detection

Francis, Seby Mon, 10 May 2010 04:07:30 -0700

Sure, I'll add it and run a new test today. I'll get back to you with the 
result.


Thanks,
Seby.

-----Original Message-----
From: sequoia-boun...@lists.forge.continuent.org 
[mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of Emmanuel 
Cecchet
Sent: Monday, May 10, 2010 7:04 AM
To: Sequoia general mailing list
Cc: sequoiadb-disc...@lists.sourceforge.net
Subject: Re: [Sequoia] Failure detection

Hi Francis,

Yes you can send me the distributed virtual database log to double
check. But as Hedera did not catch the JGroups messages I don't expect
the virtual database to have executed any handler.
Don't wait for my analysis of the logs and go ahead with the new experiment.

Thanks again for your feedback
Emmanuel

> Yes, it is enabled. Also, I'm going to try the latest hedera and will
> let you know the result. Do you need me to check the other loggings
> before I run the test?
>
> Thanks,
>
> Seby.
>
> -----Original Message-----
> From: sequoia-boun...@lists.forge.continuent.org
> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of
> Emmanuel Cecchet
> Sent: Saturday, May 08, 2010 3:54 PM
> To: Sequoia general mailing list
> Cc: sequoiadb-disc...@lists.sourceforge.net
> Subject: Re: [Sequoia] Failure detection
>
> Hi Francis,
>
> Do you have the traces with
>
> log4j.logger.org.continuent.sequoia.controller.virtualdatabase set to
> DEBUG?
>
> Could you also try with the latest version of Hedera?
>
> Sorry for the lag in the responses I have been swamped since I'm back!
>
> Emmanuel
>
> > Hello Emmanuel,
>
> >
>
> > Yes, all were in debug. Here is the snippet:
>
> >
>
> > ######################################
>
> > # Hedera group communication loggers #
>
> > ######################################
>
> > # Hedera channels test #
>
> > log4j.logger.test.org.continuent.hedera.channel=DEBUG, Console, Filetrace
>
> > log4j.additivity.test.org.continuent.hedera.channel=false
>
> > # Hedera adapters #
>
> > log4j.logger.org.continuent.hedera.adapters=DEBUG, Console, Filetrace
>
> > log4j.additivity.org.continuent.hedera.adapters=false
>
> > # Hedera factories #
>
> > log4j.logger.org.continuent.hedera.factory=DEBUG, Console, Filetrace
>
> > log4j.additivity.org.continuent.hedera.factory=false
>
> > # Hedera channels #
>
> > log4j.logger.org.continuent.hedera.channel=DEBUG, Console, Filetrace
>
> > log4j.additivity.org.continuent.hedera.channel=false
>
> > # Hedera Group Membership Service #
>
> > log4j.logger.org.continuent.hedera.gms=DEBUG, Console, Filetrace
>
> > log4j.additivity.org.continuent.hedera.gms=false
>
> > # JGroups
>
> > log4j.logger.org.jgroups=DEBUG, Console, Filetrace
>
> > log4j.additivity.org.jgroups=false
>
> > # JGroups protocols
>
> > log4j.logger.org.jgroups.protocols=DEBUG, Console, Filetrace
>
> > log4j.additivity.org.jgroups.protocols=false
>
> > ######################################
>
> >
>
> > I've the distributed logs for the same time-frame. Let me know if you
> need that.
>
> >
>
> > No, the hedera were not updated.
>
> >
>
> > Thanks,
>
> > Seby.
>
> > -----Original Message-----
>
> > From: sequoia-boun...@lists.forge.continuent.org
> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of
> Emmanuel Cecchet
>
> > Sent: Tuesday, May 04, 2010 6:20 AM
>
> > To: Sequoia general mailing list
>
> > Cc: sequoiadb-disc...@lists.sourceforge.net
>
> > Subject: Re: [Sequoia] Failure detection
>
> >
>
> > Hi Seby,
>
> >
>
> > When JGroups reported the MERGE messages in the log, did you have Hedera
>
> > DEBUG logs enabled too? If this is the case, the message was never
>
> > handled by Hedera which is a problem. The new view should have been
>
> > installed anyway by the view synchrony layer and Hedera should at least
>
> > catch that.
>
> > Can you confirm is the Hedera logs are enabled?
>
> > Could you also set the Distributed Virtual Database logs to DEBUG?
>
> > Did you try to update Hedera to a newer version?
>
> >
>
> > Thanks
>
> > Emmanuel
>
> >
>
> >
>
> >> Hi Emmanuel,
>
> >>
>
> >> Do you need more logs on this. Please let me know.
>
> >>
>
> >> Thanks,
>
> >> Seby.
>
> >>
>
> >> -----Original Message-----
>
> >> From: sequoia-boun...@lists.forge.continuent.org
> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of
> Francis, Seby
>
> >> Sent: Monday, March 29, 2010 1:51 PM
>
> >> To: Sequoia general mailing list
>
> >> Cc: sequoiadb-disc...@lists.sourceforge.net
>
> >> Subject: Re: [Sequoia] Failure detection
>
> >>
>
> >> Hi Emmanuel,
>
> >>
>
> >> I've tried different jgroup configuration and now I can see in the
> logs that the groups are merging. But for some reason, Sequoia never
> shows that it is merged. Ie; when I ran 'show controllers' on console
> I see only that particular host. Below is the snippet from one of the
> host. I see the similar on the other host showing the merge. Let me
> know if you would like to see the debug logs during the time-frame.
>
> >>
>
> >> 2010-03-29 06:59:45,683 DEBUG jgroups.protocols.VERIFY_SUSPECT
> diff=1507, mbr 10.0.0.33:35974 is dead (passing up SUSPECT event)
>
> >> 2010-03-29 06:59:45,687 DEBUG continuent.hedera.gms JGroups reported
> suspected member:10.0.0.33:35974
>
> >> 2010-03-29 06:59:45,688 DEBUG continuent.hedera.gms
> Member(address=/10.0.0.33:35974, uid=db2) leaves Group(gid=db2).
>
> >>
>
> >> 2010-03-29 06:59:45,868 INFO controller.requestmanager.cleanup
> Waiting 30000ms for client of controller 562949953421312 to failover
>
> >> 2010-03-29 07:00:15,875 INFO controller.requestmanager.cleanup
> Cleanup for controller 562949953421312 failure is completed.
>
> >>
>
> >> -----
>
> >> 2010-03-29 07:03:14,725 DEBUG protocols.pbcast.GMS I
> (10.0.0.23:49731) will be the leader. Starting the merge task for
> [10.0.0.33:35974, 10.0.0.23:49731]
>
> >> 2010-03-29 07:03:14,726 DEBUG protocols.pbcast.GMS 10.0.0.23:49731
> running merge task, coordinators are [10.0.0.33:35974, 10.0.0.23:49731]
>
> >> 2010-03-29 07:03:14,730 DEBUG protocols.pbcast.GMS Merge leader
> 10.0.0.23:49731 sending MERGE_REQ to [10.0.0.33:35974, 10.0.0.23:49731]
>
> >> 2010-03-29 07:03:14,746 DEBUG jgroups.protocols.UDP sending msg to
> 10.0.0.23:49731, src=10.0.0.23:49731, headers are GMS:
> GmsHeader[MERGE_RSP]: view=[10.0.0.23:49731|2] [10.0.0.23:49731],
> digest=10.0.0.23:49731: [44 : 47 (47)], merge_rejected=false,
> merge_id=[10.0.0.23:49731|1269860594727], UNICAST: [UNICAST: DATA,
> seqno=4], UDP: [channel_name=db2]
>
> >> 2010-03-29 07:03:14,748 DEBUG protocols.pbcast.GMS 10.0.0.23:49731
> responded to 10.0.0.23:49731, merge_id=[10.0.0.23:49731|1269860594727]
>
> >> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader
> 10.0.0.23:49731 expects 2 responses, so far got 2 responses
>
> >> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader
> 10.0.0.23:49731 collected 2 merge response(s) in 36 ms
>
> >> 2010-03-29 07:03:14,772 DEBUG protocols.pbcast.GMS Merge leader
> 10.0.0.23:49731 computed new merged view that will be
> MergeView::[10.0.0.23:49731|3] [10.0.0.23:49731, 10.0.0.33:35974],
> subgroups=[[10.0.0.23:49731|2] [10.0.0.23:49731], [10.0.0.33:35974|2]
> [10.0.0.33:35974]]
>
> >> 2010-03-29 07:03:14,773 DEBUG protocols.pbcast.GMS 10.0.0.23:49731
> is sending merge view [10.0.0.23:49731|3] to coordinators
> [10.0.0.33:35974, 10.0.0.23:49731
>
> >>
>
> >> Seby.
>
> >>
>
> >> -----Original Message-----
>
> >> From: sequoia-boun...@lists.forge.continuent.org
> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of
> Emmanuel Cecchet
>
> >> Sent: Wednesday, March 24, 2010 10:41 AM
>
> >> To: Sequoia general mailing list
>
> >> Cc: sequoiadb-disc...@lists.sourceforge.net
>
> >> Subject: Re: [Sequoia] Failure detection
>
> >>
>
> >> Hi Seby,
>
> >>
>
> >> Sorry for the late reply, I have been very busy these past days.
>
> >> This seems to be a JGroups issue that could probably be better answered
>
> >> by Bela Ban on the JGroups mailing list. I have seen emails these past
>
> >> days on the list with people having similar problem.
>
> >> I would recommend that you post an email on the JGroups mailing list
>
> >> with your JGroups configuration and the messages you see regarding
> MERGE
>
> >> failing.
>
> >>
>
> >> Keep me posted
>
> >> Emmanuel
>
> >>
>
> >>
>
> >>
>
> >>> Also, here is the error which I see from the logs:
>
> >>>
>
> >>> 2010-03-22 08:31:15,912 DEBUG protocols.pbcast.GMS Merge leader
> 10.10.10.23:39729 expects 2 responses, so far got 1 responses
>
> >>> 2010-03-22 08:31:15,913 DEBUG protocols.pbcast.GMS Merge leader
> 10.10.10.23:39729 waiting 382 msecs for merge responses
>
> >>> 2010-03-22 08:31:16,313 DEBUG protocols.pbcast.GMS At
> 10.10.10.23:39729 cancelling merge due to timer timeout (5000 ms)
>
> >>> 2010-03-22 08:31:16,314 DEBUG protocols.pbcast.GMS cancelling merge
> (merge_id=[10.10.10.23:39729|1269261071286])
>
> >>> 2010-03-22 08:31:16,316 DEBUG protocols.pbcast.GMS resumed ViewHandler
>
> >>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader
> 10.10.10.23:39729 expects 2 responses, so far got 0 responses
>
> >>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader
> 10.10.10.23:39729 collected 0 merge response(s) in 5027 ms
>
> >>> 2010-03-22 08:31:16,318 WARN protocols.pbcast.GMS Merge aborted.
> Merge leader did not get MergeData from all subgroup coordinators
> [10.10.10.33:38822, 10.10.10.23:39729]
>
> >>>
>
> >>> -----Original Message-----
>
> >>> From: Francis, Seby
>
> >>> Sent: Monday, March 22, 2010 1:03 PM
>
> >>> To: 'Sequoia general mailing list'
>
> >>> Cc: sequoiadb-disc...@lists.sourceforge.net
>
> >>> Subject: RE: [Sequoia] Failure detection
>
> >>>
>
> >>> Hi Emmanuel,
>
> >>>
>
> >>> I've updated my jgroups to the version which you have mentioned,
> but I still see the issue with Merging the groups. One of the
> controller lost track after the failure and won't merge. Can you
> please give me a hand to figure out where it goes wrong. I've the
> debug logs. Shall I send the logs as a zip file.
>
> >>>
>
> >>> Thanks,
>
> >>> Seby.
>
> >>>
>
> >>> -----Original Message-----
>
> >>> From: sequoia-boun...@lists.forge.continuent.org
> [mailto:sequoia-boun...@lists.forge.continuent.org] On Behalf Of
> Emmanuel Cecchet
>
> >>> Sent: Thursday, March 18, 2010 10:22 PM
>
> >>> To: Sequoia general mailing list
>
> >>> Cc: sequoiadb-disc...@lists.sourceforge.net
>
> >>> Subject: Re: [Sequoia] Failure detection
>
> >>>
>
> >>> Hi Seby,
>
> >>>
>
> >>> I looked into the mailing list archive and this version of JGroups
> has a
>
> >>> number of significant bugs. An issue was filed
>
> >>> (http://forge.continuent.org/jira/browse/SEQUOIA-1130) and I fixed it
>
> >>> for Sequoia 4. Just using a drop in replacement for JGroups core for
>
> >>> Sequoia 2.10.10 might work. You might have to update Hedera jars as
> well
>
> >>> but that could work with the old one too.
>
> >>>
>
> >>> Let me know if the upgrade does not work
>
> >>> Emmanuel
>
> >>>
>
> >>>
>
> >>>
>
> >>>
>
> >>>> Thanks for your support!!
>
> >>>>
>
> >>>> I'm using jgroups-core.jar Version 2.4.2 which came with
>
> >>>> "sequoia-2.10.10". My solaris test servers have only single interface
>
> >>>> and I'm using the same ip for both group & db/client
> communications. I
>
> >>>> ran a test again removing "*STATE_TRANSFER*" and attached the
> logs. At
>
> >>>> around 13:36, I took the host1 interface down and opened it around
>
> >>>> 13:38. After I opened the interface, and when I ran the show
>
> >>>> controllers on console, host1 showed both controllers while host2
>
> >>>> showed its own name in the member list.
>
> >>>>
>
> >>>> Regards,
>
> >>>>
>
> >>>> Seby.
>
> >>>>
>
> >>>> -----Original Message-----
>
> >>>> Hi Seby,
>
> >>>>
>
> >>>> Welcome to the wonderful world of group communications!
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>> I've tried various FD options and could not get it working when one
>
> >>>>>
>
> >>>>>
>
> >>>>>
>
> >>>> of the hosts fail. I can see the message 'A leaving group' on live
>
> >>>> controller B when I shutdown the interface of A. This is working as
>
> >>>> expected and the virtual db is still accessible/writable as the
>
> >>>> controller B is alive. But when I open the interface on A, the
>
> >>>> controller A shows (show controllers) that the virtual-db is
> hosted by
>
> >>>> controllers A & B while controller B just shows B. And the data
>
> >>>> inserted into the vdb hosted by controller B is NOT being played
> on A.
>
> >>>> This will cause inconsistencies in the data between the virtual-dbs.
>
> >>>> Is there a way, we can disable the backend if the network goes down,
>
> >>>> so that I can recover the db using the backup?
>
> >>>>
>
> >>>>
>
> >>>> There is a problem with your group communication configuration if
>
> >>>> controllers have different views of the group. That should not happen.
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>> I've also noticed that in some cases, if I take one of the host
>
> >>>>>
>
> >>>>>
>
> >>>>>
>
> >>>> interface down, both of them thinks that the other controller failed.
>
> >>>> This will also create issues. In my case, I only have two controllers
>
> >>>> hosted. Is it possible to ping a network gateway? That way the
>
> >>>> controller know that it is the one which failed and can disable the
>
> >>>> backend.
>
> >>>>
>
> >>>>
>
> >>>> The best solution is to use the same interface for group
> communication
>
> >>>> and client/database communications. If you use a dedicated network
> for
>
> >>>> group communications and this network fails, you will end up with a
>
> >>>> network partition and this is very bad. If all communications go
>
> >>>> through the same interface, when it goes down, all communications are
>
> >>>> down and the controller will not be able to serve stale data.
>
> >>>>
>
> >>>> You don't need STATE_TRANSFER as Sequoia has its own state transfer
>
> >>>> protocol when a new member joins a group. Which version of JGroups
> are
>
> >>>> you using? Could you send me the log with JGroups messages that you
>
> >>>> see on each controller by activating them in log4j.properties. I
> would
>
> >>>> need the initial sequence when you start the cluster and the messages
>
> >>>> you see when the failure is detected and when the failed controller
>
> >>>> joins back. There might be a problem with the timeout settings of the
>
> >>>> different component of the stack.
>
> >>>>
>
> >>>> Keep me posted with your findings
>
> >>>>
>
> >>>> Emmanuel
>
> >>>>
>
> >>>>
> ------------------------------------------------------------------------
>
> >>>>
>
> >>>>
>
> >>>
>
> >>>
>
> >>>
>
> >>
>
> >>
>
> >
>
> >
>
> >
>
> --
>
> Emmanuel Cecchet
>
> FTO @ Frog Thinker
>
> Open Source Development & Consulting
>
> --
>
> Web: http://www.frogthinker.org
>
> email: m...@frogthinker.org
>
> Skype: emmanuel_cecchet
>
> _______________________________________________
>
> Sequoia mailing list
>
> Sequoia@lists.forge.continuent.org
>
> http://forge.continuent.org/mailman/listinfo/sequoia
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Sequoia mailing list
> Sequoia@lists.forge.continuent.org
> http://forge.continuent.org/mailman/listinfo/sequoia


--
Emmanuel Cecchet
FTO @ Frog Thinker
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet

_______________________________________________
Sequoia mailing list
Sequoia@lists.forge.continuent.org
http://forge.continuent.org/mailman/listinfo/sequoia
_______________________________________________
Sequoia mailing list
Sequoia@lists.forge.continuent.org
http://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] Failure detection

Reply via email to