Re: [ceph-users] CephFS in the wild

Brady Deetz Thu, 02 Jun 2016 19:14:41 -0700

On Thu, Jun 2, 2016 at 8:58 PM, Christian Balzer <ch...@gol.com> wrote:


> On Thu, 2 Jun 2016 11:11:19 -0500 Brady Deetz wrote:
>
> > On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <ch...@gol.com> wrote:
> >
> > >
> > > Hello,
> > >
> > > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
> > >
> > > > Question:
> > > > I'm curious if there is anybody else out there running CephFS at the
> > > > scale I'm planning for. I'd like to know some of the issues you
> > > > didn't expect that I should be looking out for. I'd also like to
> > > > simply see when CephFS hasn't worked out and why. Basically, give me
> > > > your war stories.
> > > >
> > > Not me, but diligently search the archives, there are people with large
> > > CephFS deployments (despite the non-production status when they did
> > > them). Also look at the current horror story thread about what happens
> > > when you have huge directories.
> > >
> > > >
> > > > Problem Details:
> > > > Now that I'm out of my design phase and finished testing on VMs, I'm
> > > > ready to drop $100k on a pilo. I'd like to get some sense of
> > > > confidence from the community that this is going to work before I
> > > > pull the trigger.
> > > >
> > > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
> > > > with CephFS by this time next year (hopefully by December). My
> > > > workload is a mix of small and vary large files (100GB+ in size). We
> > > > do fMRI analysis on DICOM image sets as well as other physio data
> > > > collected from subjects. We also have plenty of spreadsheets,
> > > > scripts, etc. Currently 90% of our analysis is I/O bound and
> > > > generally sequential.
> > > >
> > > There are other people here doing similar things (medical institutes,
> > > universities), again search the archives and maybe contact them
> > > directly.
> > >
> > > > In deploying Ceph, I am hoping to see more throughput than the 7320
> > > > can currently provide. I'm also looking to get away from traditional
> > > > file-systems that require forklift upgrades. That's where Ceph really
> > > > shines for us.
> > > >
> > > > I don't have a total file count, but I do know that we have about
> > > > 500k directories.
> > > >
> > > >
> > > > Planned Architecture:
> > > >
> > > Well, we talked about this 2 months ago, you seem to have changed only
> > > a few things.
> > > So lets dissect this again...
> > >
> > > > Storage Interconnect:
> > > > Brocade VDX 6940 (40 gig)
> > > >
> > > Is this a flat (single) network for all the storage nodes?
> > > And then from these 40Gb/s switches links to the access switches?
> > >
> >
> > This will start as a single 40Gb/s switch with a single link to each node
> > (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch
> > will also be connected to several 10Gb/s and 1Gb/s access switches with
> > dual 40Gb/s uplinks.
> >
> So initially 80Gb/s and with the 2nd switch probably 160Gb/s for your
> clients.
> Network wise, your 8 storage servers outstrip that, actual storage
> bandwidth and IOPS wise, you're looking at 8x2GB/s aka 160Gb/s best case
> writes, so a match.
>
> > We do intend to segment the public and private networks using VLANs
> > untagged at the node. There are obviously many subnets on our network.
> > The 40Gb/s switch will handle routing for those networks.
> >
> > You can see list discussion in "Public and Private network over 1
> > interface" May 23,2016 regarding some of this.
> >
> And I did comment in that thread, the final one actually. ^o^
>
> Unless you can come up with a _very_ good reason not covered in that
> thread, I'd keep it to one network.
>
> Once the 2nd switch is in place and running vLAG (LACP on your servers)
> your network bandwidth per host VASTLY exceeds that of your storage.
>
>
My theory is that with a single switch, I can QoS traffic for the private
network in case of the situation where we do see massive client I/O at the
same time that a re-weight or something like that was happening. But... I
think you're right. KISS

My initial KISS thought was single network was the opposite due to the
alternate and maybe less tested configuration of Ceph. Perhaps
multi-netting is a better compromise. We still run 2 networks, but not over
separate VLANs.

Terrible idea?


> >
> > >
> > > > Access Switches for clients (servers):
> > > > Brocade VDX 6740 (10 gig)
> > > >
> > > > Access Switches for clients (workstations):
> > > > Brocade ICX 7450
> > > >
> > > > 3x MON:
> > > > 128GB RAM
> > > > 2x 200GB SSD for OS
> > > > 2x 400GB P3700 for LevelDB
> > > > 2x E5-2660v4
> > > > 1x Dual Port 40Gb Ethernet
> > > >
> > > Total overkill in the CPU core arena, fewer but faster cores would be
> > > more suited for this task.
> > > A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing
> > > like that, the closest one would be the E5-2643v4.
> > >
> > > Same for RAM, MON processes are pretty frugal.
> > >
> > > No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and
> > > thus the leveldb) and that's being overly generous in the speed/IOPS
> > > department.
> > >
> > > Note also that 40Gb/s isn't really needed here, alas latency and KISS
> > > do speak in favor of it, especially if you can afford it.
> > >
> >
> > Noted
> >
> >
> > >
> > > > 2x MDS:
> > > > 128GB RAM
> > > > 2x 200GB SSD for OS
> > > > 2x 400GB P3700 for LevelDB (is this necessary?)
> > > No, there isn't any persistent data with MDS, unlike what I assumed as
> > > well before reading up on it and trying it out for the first time.
> > >
> >
> > That's what I thought. For some reason, my VAR keeps throwing these on
> > the config.
> >
> That's their job after all, selling you hardware that you don't need so
> that they can create added value (for themselves). ^o^
>
> >
> > >
> > > > 2x E5-2660v4
> > > > 1x Dual Port 40Gb Ethernet
> > > >
> > > Dedicated MONs/MDS are often a waste, they are suggested to avoid
> > > people who don't know what they're doing from overloading things.
> > >
> > > So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make
> > > the first one a dedicated MON and give it the lowest IP.
> > > HW Specs as discussed above, make sure to use DIMMs that allow you to
> > > upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons
> > > (from my limited experience with CephFS).
> > > So:
> > >
> > > 128GB RAM (expandable to 256GB or more)
> > > 2x E5-2643v4
> > > 2x 400GB DC S3710
> > > 1x Dual Port 40Gb Ethernet
> > >
> > > > 8x OSD:
> > > > 128GB RAM
> > > Use your savings above to make that 256GB for grate performance
> > > improvements as hot objects stay in memory and so will all dir-entries
> > > (in SLAB).
> > >
> >
> > I like this idea.
> >
> >
> > >
> > > > 2x 200GB SSD for OS
> > > Overkill really. Other than the normally rather terse OSD logs, nothing
> > > much will ever be written to them. So 3510s or at most 3610s.
> > >
> > > > 2x 400GB P3700 for Journals
> > > As discussed 2 months ago, this limits you to writes at half (or
> > > quarter depending on your design and if you do LACP, vLAG) of what
> > > your network is capable of.
> > > OTOH, I wouldn't expect your 24 HDDs do to much better than 2GB/s
> > > either (at least with filestore and bluestore is a year away at best).
> > > So good enough, especially if you're read heavy.
> > >
> >
> > Yeah, the thought is that we're going to be close to equilibrium. It's
> > not too big a deal to add an extra card, so my plan was to expand to 3 if
> > necessary after our pilot project.
> >
> Probably not needed, because the moment you're not doing synthetic
> large sequential write only tests you will find that random writes and
> reads (these can be offset up to a point by the large RAM) will slow your
> HDDs down well below the speed of the NVMes.
>
> Christian
> >
> > >
> > > > 24x 6TB Enterprise SATA
> > > > 2x E5-2660v4
> > > > 1x Dual Port 40Gb Ethernet
> > >
> > > Regards,
> > >
> > > Christian
> > >
> >
> > As always, I appreciate your comments and time. I'm looking forward to
> > joining you and the rest of the community in operating a great Ceph
> > environment.
> >
> >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > ch...@gol.com           Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS in the wild

Reply via email to