Out of curiosity, why are you taking a scale-up approach to building your ceph clusters instead of a scale-out approach? Ceph has traditionally been geared towards a scale-out, simple shared nothing mindset. These dual ToR deploys remind me of something from EMC, not ceph. Really curious as I'd rather have 5-6 racks of single ToR switches as opposed to three racks of dual ToR. Is there a specific application or requirement? It's definitely adding a lot of complexity; just wondering what the payoff is.
Also, why are you putting your "cluster network" on the same physical interfaces but on separate VLANs? Traffic shaping/policing? What's your link speed there on the hosts? (25/40gbps?) On Sat, Apr 22, 2017 at 12:13 PM, Aaron Bassett <aaron.bass...@nantomics.com > wrote: > FWIW, I use a CLOS fabric with layer 3 right down to the hosts and > multiple ToRs to enable HA/ECMP to each node. I'm using Cumulus Linux's > "redistribute neighbor" feature, which advertises a /32 for any ARP'ed > neighbor. I set up the hosts with an IP on each physical interface and on > an aliased looopback: lo:0. I handle the separate cluster network by adding > a vlan to each interface and routing those separately on the ToRs with acls > to keep traffic apart. > > Their documentation may help clarify a bit: > https://docs.cumulusnetworks.com/display/DOCS/Redistribute+ > Neighbor#RedistributeNeighbor-ConfiguringtheHost(s) > > Honestly the trickiest part is getting the routing on the hosts right, you > essentially set static routes over each link and the kernel takes care of > the ECMP. > > I understand this is a bit different from your setup, but Ceph has no > trouble at all with the IPs on multiple interfaces. > > Aaron > > Date: Sat, 22 Apr 2017 17:37:01 +0000 > From: Maxime Guyot <maxime.gu...@elits.com> > To: Richard Hesse <richard.he...@weebly.com>, Jan Marquardt > <j...@artfiles.de> > Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> > Subject: Re: [ceph-users] Ceph with Clos IP fabric > Message-ID: <919c8615-c50b-4611-9b6b-13b4fbf69...@elits.com> > Content-Type: text/plain; charset="utf-8" > > Hi, > > That only makes sense if you're running multiple ToR switches per rack for > the public leaf network. Multiple public ToR switches per rack is not very > common; most Clos crossbar networks run a single ToR switch. Several > >guides on the topic (including Arista & Cisco) suggest that you use > something like MLAG in a layer 2 domain between the switches if you need > some sort of switch redundancy inside the rack. This increases complexity, > and most people decide that it's not worth it and instead scale out across > racks to gain the redundancy and survivability that multiple ToR offer. > > If you use MLAG for L2 redundancy, you?ll still want 2 BGP sessions for L3 > redundancy, so why not skipping the MLAG all together and terminating your > BGP session on each ToR? > > Judging by the routes (169.254.0.1), you are using BGP unnumebered? > > It sounds like the ?ip route get? output you get when using dummy0 is > caused by a fallback on the default route, supposedly on eth0? Can check > the exact routes received on server1 with ?show ip bgp neighbors <neighbor> > received-routes? once you enable ?neighbor <neighbor> soft-reconfiguration > inbound? and what?s installed in the table ?ip route?? > > > Intrigued by this problem, I tried to reproduce it in a lab with > virtualbox. I ran into the same problem. > > Side note: Configuring the loopback IP on the physical interfaces is > workable if you set it on **all** parallel links. Example with server1: > > ?iface enp3s0f0 inet static > address 10.10.100.21/32 > iface enp3s0f1 inet static > address 10.10.100.21/32 > iface enp4s0f0 inet static > address 10.10.100.21/32 > iface enp4s0f1 inet static > address 10.10.100.21/32? > > This should guarantee that the loopback ip is advertised if one of the 4 > links to switch1 and switch2 is up, but I am not sure if that?s workable > for ceph?s listening address. > > > Cheers, > Maxime > > From: Richard Hesse <richard.he...@weebly.com> > Date: Thursday 20 April 2017 16:36 > To: Maxime Guyot <maxime.gu...@elits.com> > Cc: Jan Marquardt <j...@artfiles.de>, "ceph-users@lists.ceph.com" < > ceph-users@lists.ceph.com> > Subject: Re: [ceph-users] Ceph with Clos IP fabric > > On Thu, Apr 20, 2017 at 2:13 AM, Maxime Guyot <maxime.gu...@elits.com< > mailto:maxime.gu...@elits.com <maxime.gu...@elits.com>>> wrote: > > 2) Why did you choose to run the ceph nodes on loopback interfaces as > opposed to the /24 for the "public" interface? > > I can?t speak for this example, but in a clos fabric you generally want to > assign the routed IPs on loopback rather than physical interfaces. This way > if one of the link goes down (t.ex the public interface), the routed IP is > still advertised on the other link(s). > > That only makes sense if you're running multiple ToR switches per rack for > the public leaf network. Multiple public ToR switches per rack is not very > common; most Clos crossbar networks run a single ToR switch. Several guides > on the topic (including Arista & Cisco) suggest that you use something like > MLAG in a layer 2 domain between the switches if you need some sort of > switch redundancy inside the rack. This increases complexity, and most > people decide that it's not worth it and instead scale out across racks to > gain the redundancy and survivability that multiple ToR offer. > > On Thu, Apr 20, 2017 at 4:04 AM, Jan Marquardt <j...@artfiles.de<mailto:jm@ > artfiles.de <j...@artfiles.de>>> wrote: > > Maxime, thank you for clarifying this. Each server is configured like this: > > lo/dummy0: Loopback interface; Holds the ip address used with Ceph, > which is announced by BGP into the fabric. > > enp5s0: Management Interface, which is used only for managing the box. > There should not be any Ceph traffic on this one. > > enp3s0f0: connected to sw01 and used for BGP > enp3s0f1: connected to sw02 and used for BGP > enp4s0f0: connected to sw01 and used for BGP > enp4s0f1: connected to sw02 and used for BGP > > These four interfaces are supposed to transport the Ceph traffic. > > See above. Why are you running multiple public ToR switches in this rack? > I'd suggest switching them to a single layer 2 domain and participate in > the Clos fabric as a single unit, or scale out across racks (preferred). > Why bother with multiple switches in a rack when you can just use multiple > racks? That's the beauty of Clos: just add more spines if you need more > leaf to leaf bandwidth. > > How many OSD, servers, and racks are planned for this deployment? > > -richard > > > CONFIDENTIALITY NOTICE > This e-mail message and any attachments are only for the use of the > intended recipient and may contain information that is privileged, > confidential or exempt from disclosure under applicable law. If you are not > the intended recipient, any disclosure, distribution or other use of this > e-mail message or attachments is prohibited. If you have received this > e-mail message in error, please delete and notify the sender immediately. > Thank you. > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com