Out of curiosity, why are you taking a scale-up approach to building your
ceph clusters instead of a scale-out approach? Ceph has traditionally been
geared towards a scale-out, simple shared nothing mindset. These dual ToR
deploys remind me of something from EMC, not ceph. Really curious as I'd
rather have 5-6 racks of single ToR switches as opposed to three racks of
dual ToR. Is there a specific application or requirement? It's definitely
adding a lot of complexity; just wondering what the payoff is.

Also, why are you putting your "cluster network" on the same physical
interfaces but on separate VLANs? Traffic shaping/policing? What's your
link speed there on the hosts? (25/40gbps?)

On Sat, Apr 22, 2017 at 12:13 PM, Aaron Bassett <aaron.bass...@nantomics.com
> wrote:

> FWIW, I use a CLOS fabric with layer 3 right down to the hosts and
> multiple ToRs to enable HA/ECMP to each node. I'm using Cumulus Linux's
> "redistribute neighbor" feature, which advertises a /32 for any ARP'ed
> neighbor. I set up the hosts with an IP on each physical interface and on
> an aliased looopback: lo:0. I handle the separate cluster network by adding
> a vlan to each interface and routing those separately on the ToRs with acls
> to keep traffic apart.
>
> Their documentation may help clarify a bit:
> https://docs.cumulusnetworks.com/display/DOCS/Redistribute+
> Neighbor#RedistributeNeighbor-ConfiguringtheHost(s)
>
> Honestly the trickiest part is getting the routing on the hosts right, you
> essentially set static routes over each link and the kernel takes care of
> the ECMP.
>
> I understand this is a bit different from your setup, but Ceph has no
> trouble at all with the IPs on multiple interfaces.
>
> Aaron
>
> Date: Sat, 22 Apr 2017 17:37:01 +0000
> From: Maxime Guyot <maxime.gu...@elits.com>
> To: Richard Hesse <richard.he...@weebly.com>, Jan Marquardt
> <j...@artfiles.de>
> Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Ceph with Clos IP fabric
> Message-ID: <919c8615-c50b-4611-9b6b-13b4fbf69...@elits.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> That only makes sense if you're running multiple ToR switches per rack for
> the public leaf network. Multiple public ToR switches per rack is not very
> common; most Clos crossbar networks run a single ToR switch. Several
> >guides on the topic (including Arista & Cisco) suggest that you use
> something like MLAG in a layer 2 domain between the switches if you need
> some sort of switch redundancy inside the rack. This increases complexity,
> and most people decide that it's not worth it and instead scale out across
> racks to gain the redundancy and survivability that multiple ToR offer.
>
> If you use MLAG for L2 redundancy, you?ll still want 2 BGP sessions for L3
> redundancy, so why not skipping the MLAG all together and terminating your
> BGP session on each ToR?
>
> Judging by the routes (169.254.0.1), you are using BGP unnumebered?
>
> It sounds like the ?ip route get? output you get when using dummy0 is
> caused by a fallback on the default route, supposedly on eth0? Can check
> the exact routes received on server1 with ?show ip bgp neighbors <neighbor>
> received-routes? once you enable ?neighbor <neighbor> soft-reconfiguration
> inbound? and what?s installed in the table ?ip route??
>
>
> Intrigued by this problem, I tried to reproduce it in a lab with
> virtualbox. I ran into the same problem.
>
> Side note: Configuring the loopback IP on the physical interfaces is
> workable if you set it on **all** parallel links. Example with server1:
>
> ?iface enp3s0f0 inet static
>  address 10.10.100.21/32
> iface enp3s0f1 inet static
>  address 10.10.100.21/32
> iface enp4s0f0 inet static
>  address 10.10.100.21/32
> iface enp4s0f1 inet static
>  address 10.10.100.21/32?
>
> This should guarantee that the loopback ip is advertised if one of the 4
> links to switch1 and switch2 is up, but I am not sure if that?s workable
> for ceph?s listening address.
>
>
> Cheers,
> Maxime
>
> From: Richard Hesse <richard.he...@weebly.com>
> Date: Thursday 20 April 2017 16:36
> To: Maxime Guyot <maxime.gu...@elits.com>
> Cc: Jan Marquardt <j...@artfiles.de>, "ceph-users@lists.ceph.com" <
> ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Ceph with Clos IP fabric
>
> On Thu, Apr 20, 2017 at 2:13 AM, Maxime Guyot <maxime.gu...@elits.com<
> mailto:maxime.gu...@elits.com <maxime.gu...@elits.com>>> wrote:
>
> 2) Why did you choose to run the ceph nodes on loopback interfaces as
> opposed to the /24 for the "public" interface?
>
> I can?t speak for this example, but in a clos fabric you generally want to
> assign the routed IPs on loopback rather than physical interfaces. This way
> if one of the link goes down (t.ex the public interface), the routed IP is
> still advertised on the other link(s).
>
> That only makes sense if you're running multiple ToR switches per rack for
> the public leaf network. Multiple public ToR switches per rack is not very
> common; most Clos crossbar networks run a single ToR switch. Several guides
> on the topic (including Arista & Cisco) suggest that you use something like
> MLAG in a layer 2 domain between the switches if you need some sort of
> switch redundancy inside the rack. This increases complexity, and most
> people decide that it's not worth it and instead  scale out across racks to
> gain the redundancy and survivability that multiple ToR offer.
>
> On Thu, Apr 20, 2017 at 4:04 AM, Jan Marquardt <j...@artfiles.de<mailto:jm@
> artfiles.de <j...@artfiles.de>>> wrote:
>
> Maxime, thank you for clarifying this. Each server is configured like this:
>
> lo/dummy0: Loopback interface; Holds the ip address used with Ceph,
> which is announced by BGP into the fabric.
>
> enp5s0: Management Interface, which is used only for managing the box.
> There should not be any Ceph traffic on this one.
>
> enp3s0f0: connected to sw01 and used for BGP
> enp3s0f1: connected to sw02 and used for BGP
> enp4s0f0: connected to sw01 and used for BGP
> enp4s0f1: connected to sw02 and used for BGP
>
> These four interfaces are supposed to transport the Ceph traffic.
>
> See above. Why are you running multiple public ToR switches in this rack?
> I'd suggest switching them to a single layer 2 domain and participate in
> the Clos fabric as a single unit, or scale out across racks (preferred).
> Why bother with multiple switches in a rack when you can just use multiple
> racks? That's the beauty of Clos: just add more spines if you need more
> leaf to leaf bandwidth.
>
> How many OSD, servers, and racks are planned for this deployment?
>
> -richard
>
>
> CONFIDENTIALITY NOTICE
> This e-mail message and any attachments are only for the use of the
> intended recipient and may contain information that is privileged,
> confidential or exempt from disclosure under applicable law. If you are not
> the intended recipient, any disclosure, distribution or other use of this
> e-mail message or attachments is prohibited. If you have received this
> e-mail message in error, please delete and notify the sender immediately.
> Thank you.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to