Hello, On Wed, 30 Jul 2014 05:21:18 -0400 Robert Fantini wrote:
> Christian. > I'll start out with 4 nodes. I understand re-balancing takes time. [ > Eventually I'll need to swap out one of the nodes with a host I'm using > for production.. But that'll be on a Saturday afternoon.. ] > Your call, but it might not be pretty or short depending on the volume of data involved. > > However I do not fully get this: > > > *"No, the default is to split at host level. So once you have enough > nodes in one room to fulfill the replication level (3) some PGs will be > all in that location "* > > *can you please send this:* > > > *non default firefly cepf.conf settings for a 4 node anti-cephalopod > cluster ? * > There is nothing you need to configure in any way special with the 4 nodes you mentioned (2 in each location), provided that each holds one OSD only. To avoid any unwanted recovery until you've reviewed things, use "mon osd downout subtree limit = host" in your ceph.conf or any of the other ways discussed to disable OSDs being set out. Christian > I want to start my testing with close to ideal ceph settings . Then do a > lot of testing of noout and other things. > After I'm done I'll document what was done and post it a few places. > > I appreciate the suggestions you've sent . > > kind regards, rob fantini > > > > > > > > > > On Tue, Jul 29, 2014 at 9:49 PM, Christian Balzer <ch...@gol.com> wrote: > > > > > Hello, > > > > On Tue, 29 Jul 2014 06:33:14 -0400 Robert Fantini wrote: > > > > > Christian - > > > Thank you for the answer, I'll get around to reading 'Crush Maps > > > ' a few times , it is important to have a good understanding of > > > ceph parts. > > > > > > So another question - > > > > > > As long as I keep the same number of nodes in both rooms, will > > > firefly defaults keep data balanced? > > > > > No, the default is to split at host level. > > So once you have enough nodes in one room to fulfill the replication > > level (3) some PGs will be all in that location. > > > > > > > > If not I'll stick with 2 each room until I understand how configure > > > things. > > > > > That will work, but I would strongly advise you to get it right from > > the start, as in configure the Crush map to your needs split on room > > or such. > > > > Because if you introduce this change later, your data will be > > rebalanced... > > > > Christian > > > > > > > > On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer <ch...@gol.com> > > > wrote: > > > > > > > > > > > On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote: > > > > > > > > > "target replication level of 3" > > > > > " with a min of 1 across the node level" > > > > > > > > > > After reading > > > > > http://ceph.com/docs/master/rados/configuration/ceph-conf/ , I > > > > > assume that to accomplish that then set these in ceph.conf ? > > > > > > > > > > osd pool default size = 3 > > > > > osd pool default min size = 1 > > > > > > > > > Not really, the min size specifies how few replicas need to be > > > > online for Ceph to accept IO. > > > > > > > > These (the current Firefly defaults) settings with the default > > > > crush map will have 3 sets of data spread over 3 OSDs and not use > > > > the same node (host) more than once. > > > > So with 2 nodes in each location, a replica will always be both > > > > locations. However if you add more nodes, all of them could wind > > > > up in the same building. > > > > > > > > To prevent this, you have location qualifiers beyond host and you > > > > can modify the crush map to enforce that at least one replica is > > > > in a different rack, row, room, region, etc. > > > > > > > > Advanced material, but one really needs to understand this: > > > > http://ceph.com/docs/master/rados/operations/crush-map/ > > > > > > > > Christian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 28, 2014 at 2:56 PM, Michael > > > > > <mich...@onlinefusion.co.uk > > > > > > > > wrote: > > > > > > > > > > > If you've two rooms then I'd go for two OSD nodes in each > > > > > > room, a target replication level of 3 with a min of 1 across > > > > > > the node level, then have 5 monitors and put the last monitor > > > > > > outside of either room (The other MON's can share with the OSD > > > > > > nodes if needed). Then you've got 'safe' replication for > > > > > > OSD/node replacement on failure with some 'shuffle' room for > > > > > > when it's needed and either room can be down while the > > > > > > external last monitor allows the decisions required to allow a > > > > > > single room to operate. > > > > > > > > > > > > There's no way you can do a 3/2 MON split that doesn't risk > > > > > > the two nodes being up and unable to serve data while the > > > > > > three are down so you'd need to find a way to make it a 2/2/1 > > > > > > split instead. > > > > > > > > > > > > -Michael > > > > > > > > > > > > > > > > > > On 28/07/2014 18:41, Robert Fantini wrote: > > > > > > > > > > > > OK for higher availability then 5 nodes is better then 3 . > > > > > > So we'll run 5 . However we want normal operations with just 2 > > > > > > nodes. Is that possible? > > > > > > > > > > > > Eventually 2 nodes will be next building 10 feet away , with a > > > > > > brick wall in between. Connected with Infiniband or better. So > > > > > > one room can go off line the other will be on. The flip of > > > > > > the coin means the 3 node room will probably go down. > > > > > > All systems will have dual power supplies connected to > > > > > > different UPS'. In addition we have a power generator. Later > > > > > > we'll have a 2-nd generator. and then the UPS's will use > > > > > > different lines attached to those generators somehow.. > > > > > > Also of course we never count on one cluster to have our > > > > > > data. We have 2 co-locations with backup going to often using > > > > > > zfs send receive and or rsync . > > > > > > > > > > > > So for the 5 node cluster, how do we set it so 2 nodes up = > > > > > > OK ? Or is that a bad idea? > > > > > > > > > > > > > > > > > > PS: any other idea on how to increase availability are > > > > > > welcome . > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer > > > > > > <ch...@gol.com> wrote: > > > > > > > > > > > >> On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote: > > > > > >> > > > > > >> > On 07/28/2014 08:49 AM, Christian Balzer wrote: > > > > > >> > > > > > > > >> > > Hello, > > > > > >> > > > > > > > >> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote: > > > > > >> > > > > > > > >> > >> Hello Christian, > > > > > >> > >> > > > > > >> > >> Let me supply more info and answer some questions. > > > > > >> > >> > > > > > >> > >> * Our main concern is high availability, not speed. > > > > > >> > >> Our storage requirements are not huge. > > > > > >> > >> However we want good keyboard response 99.99% of the > > > > > >> > >> time. We mostly do data entry and reporting. 20-25 > > > > > >> > >> users doing mostly order , invoice processing and email. > > > > > >> > >> > > > > > >> > >> * DRBD has been very reliable , but I am the SPOF . > > > > > >> > >> Meaning that when split brain occurs [ every 18-24 > > > > > >> > >> months ] it is me or no one who knows what to do. Try to > > > > > >> > >> explain how to deal with split brain in advance.... For > > > > > >> > >> the future ceph looks like it will be easier to maintain. > > > > > >> > >> > > > > > >> > > The DRBD people would of course tell you to configure > > > > > >> > > things in a way that a split brain can't happen. ^o^ > > > > > >> > > > > > > > >> > > Note that given the right circumstances (too many OSDs > > > > > >> > > down, MONs > > > > > >> down) > > > > > >> > > Ceph can wind up in a similar state. > > > > > >> > > > > > > >> > > > > > > >> > I am not sure what you mean by ceph winding up in a similar > > > > > >> > state. If you mean regarding 'split brain' in the usual > > > > > >> > sense of the term, it does not occur in Ceph. If it does, > > > > > >> > you have surely found a bug and you should let us know with > > > > > >> > lots of CAPS. > > > > > >> > > > > > > >> > What you can incur though if you have too many monitors > > > > > >> > down is cluster downtime. The monitors will ensure you > > > > > >> > need a strict majority of monitors up in order to operate > > > > > >> > the cluster, and will not serve requests if said majority > > > > > >> > is not in place. The monitors will only serve requests > > > > > >> > when there's a formed 'quorum', and a quorum is only formed > > > > > >> > by (N/2)+1 monitors, N being the total number of monitors > > > > > >> > in the cluster (via the monitor map -- monmap). > > > > > >> > > > > > > >> > This said, if out of 3 monitors you have 2 monitors down, > > > > > >> > your cluster will cease functioning (no admin commands, no > > > > > >> > writes or reads served). As there is no configuration in > > > > > >> > which you can have two strict majorities, thus no two > > > > > >> > partitions of the cluster are able to function at the same > > > > > >> > time, you do not incur in split brain. > > > > > >> > > > > > > >> I wrote similar state, not "same state". > > > > > >> > > > > > >> From a user perspective it is purely semantics how and why > > > > > >> your shared storage has seized up, the end result is the same. > > > > > >> > > > > > >> And yes, that MON example was exactly what I was aiming for, > > > > > >> your cluster might still have all the data (another potential > > > > > >> failure mode of cause), but is inaccessible. > > > > > >> > > > > > >> DRBD will see and call it a split brain, Ceph will call it a > > > > > >> Paxos voting failure, it doesn't matter one iota to the poor > > > > > >> sod relying on that particular storage. > > > > > >> > > > > > >> My point was and is, when you design a cluster of whatever > > > > > >> flavor, make sure you understand how it can (and WILL) fail, > > > > > >> how to prevent that from happening if at all possible and how > > > > > >> to recover from it if not. > > > > > >> > > > > > >> Potentially (hopefully) in the case of Ceph it would be just > > > > > >> to get a missing MON back up. > > > > > >> But given that the failed MON might have a corrupted leveldb > > > > > >> (it happened to me) will put Robert back into square one, as > > > > > >> in, a highly qualified engineer has to deal with the issue. > > > > > >> I.e somebody who can say "screw this dead MON, lets get a new > > > > > >> one in" and is capable of doing so. > > > > > >> > > > > > >> Regards, > > > > > >> > > > > > >> Christian > > > > > >> > > > > > >> > If you are a creative admin however, you may be able to > > > > > >> > enforce split brain by modifying monmaps. In the end you'd > > > > > >> > obviously end up with two distinct monitor clusters, but if > > > > > >> > you so happened to not inform the clients about this > > > > > >> > there's a fair chance that it would cause havoc with > > > > > >> > unforeseen effects. Then again, this would be the > > > > > >> > operator's fault, not Ceph itself -- especially because > > > > > >> > rewriting monitor maps is not trivial enough for someone to > > > > > >> > mistakenly do something like this. > > > > > >> > > > > > > >> > -Joao > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> Christian Balzer Network/Systems Engineer > > > > > >> ch...@gol.com Global OnLine Japan/Fusion > > > > > >> Communications http://www.gol.com/ > > > > > >> _______________________________________________ > > > > > >> ceph-users mailing list > > > > > >> ceph-users@lists.ceph.com > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > ceph-users mailing > > > > > > listceph-us...@lists.ceph.comhttp:// > > > > lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > ceph-users mailing list > > > > > > ceph-users@lists.ceph.com > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Christian Balzer Network/Systems Engineer > > > > ch...@gol.com Global OnLine Japan/Fusion Communications > > > > http://www.gol.com/ > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@lists.ceph.com > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > -- > > Christian Balzer Network/Systems Engineer > > ch...@gol.com Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com