On Thu, Jun 2, 2016 at 8:58 PM, Christian Balzer <ch...@gol.com> wrote:
> On Thu, 2 Jun 2016 11:11:19 -0500 Brady Deetz wrote: > > > On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <ch...@gol.com> wrote: > > > > > > > > Hello, > > > > > > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote: > > > > > > > Question: > > > > I'm curious if there is anybody else out there running CephFS at the > > > > scale I'm planning for. I'd like to know some of the issues you > > > > didn't expect that I should be looking out for. I'd also like to > > > > simply see when CephFS hasn't worked out and why. Basically, give me > > > > your war stories. > > > > > > > Not me, but diligently search the archives, there are people with large > > > CephFS deployments (despite the non-production status when they did > > > them). Also look at the current horror story thread about what happens > > > when you have huge directories. > > > > > > > > > > > Problem Details: > > > > Now that I'm out of my design phase and finished testing on VMs, I'm > > > > ready to drop $100k on a pilo. I'd like to get some sense of > > > > confidence from the community that this is going to work before I > > > > pull the trigger. > > > > > > > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 > > > > with CephFS by this time next year (hopefully by December). My > > > > workload is a mix of small and vary large files (100GB+ in size). We > > > > do fMRI analysis on DICOM image sets as well as other physio data > > > > collected from subjects. We also have plenty of spreadsheets, > > > > scripts, etc. Currently 90% of our analysis is I/O bound and > > > > generally sequential. > > > > > > > There are other people here doing similar things (medical institutes, > > > universities), again search the archives and maybe contact them > > > directly. > > > > > > > In deploying Ceph, I am hoping to see more throughput than the 7320 > > > > can currently provide. I'm also looking to get away from traditional > > > > file-systems that require forklift upgrades. That's where Ceph really > > > > shines for us. > > > > > > > > I don't have a total file count, but I do know that we have about > > > > 500k directories. > > > > > > > > > > > > Planned Architecture: > > > > > > > Well, we talked about this 2 months ago, you seem to have changed only > > > a few things. > > > So lets dissect this again... > > > > > > > Storage Interconnect: > > > > Brocade VDX 6940 (40 gig) > > > > > > > Is this a flat (single) network for all the storage nodes? > > > And then from these 40Gb/s switches links to the access switches? > > > > > > > This will start as a single 40Gb/s switch with a single link to each node > > (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch > > will also be connected to several 10Gb/s and 1Gb/s access switches with > > dual 40Gb/s uplinks. > > > So initially 80Gb/s and with the 2nd switch probably 160Gb/s for your > clients. > Network wise, your 8 storage servers outstrip that, actual storage > bandwidth and IOPS wise, you're looking at 8x2GB/s aka 160Gb/s best case > writes, so a match. > > > We do intend to segment the public and private networks using VLANs > > untagged at the node. There are obviously many subnets on our network. > > The 40Gb/s switch will handle routing for those networks. > > > > You can see list discussion in "Public and Private network over 1 > > interface" May 23,2016 regarding some of this. > > > And I did comment in that thread, the final one actually. ^o^ > > Unless you can come up with a _very_ good reason not covered in that > thread, I'd keep it to one network. > > Once the 2nd switch is in place and running vLAG (LACP on your servers) > your network bandwidth per host VASTLY exceeds that of your storage. > > My theory is that with a single switch, I can QoS traffic for the private network in case of the situation where we do see massive client I/O at the same time that a re-weight or something like that was happening. But... I think you're right. KISS My initial KISS thought was single network was the opposite due to the alternate and maybe less tested configuration of Ceph. Perhaps multi-netting is a better compromise. We still run 2 networks, but not over separate VLANs. Terrible idea? > > > > > > > > > Access Switches for clients (servers): > > > > Brocade VDX 6740 (10 gig) > > > > > > > > Access Switches for clients (workstations): > > > > Brocade ICX 7450 > > > > > > > > 3x MON: > > > > 128GB RAM > > > > 2x 200GB SSD for OS > > > > 2x 400GB P3700 for LevelDB > > > > 2x E5-2660v4 > > > > 1x Dual Port 40Gb Ethernet > > > > > > > Total overkill in the CPU core arena, fewer but faster cores would be > > > more suited for this task. > > > A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing > > > like that, the closest one would be the E5-2643v4. > > > > > > Same for RAM, MON processes are pretty frugal. > > > > > > No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and > > > thus the leveldb) and that's being overly generous in the speed/IOPS > > > department. > > > > > > Note also that 40Gb/s isn't really needed here, alas latency and KISS > > > do speak in favor of it, especially if you can afford it. > > > > > > > Noted > > > > > > > > > > > 2x MDS: > > > > 128GB RAM > > > > 2x 200GB SSD for OS > > > > 2x 400GB P3700 for LevelDB (is this necessary?) > > > No, there isn't any persistent data with MDS, unlike what I assumed as > > > well before reading up on it and trying it out for the first time. > > > > > > > That's what I thought. For some reason, my VAR keeps throwing these on > > the config. > > > That's their job after all, selling you hardware that you don't need so > that they can create added value (for themselves). ^o^ > > > > > > > > > > 2x E5-2660v4 > > > > 1x Dual Port 40Gb Ethernet > > > > > > > Dedicated MONs/MDS are often a waste, they are suggested to avoid > > > people who don't know what they're doing from overloading things. > > > > > > So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make > > > the first one a dedicated MON and give it the lowest IP. > > > HW Specs as discussed above, make sure to use DIMMs that allow you to > > > upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons > > > (from my limited experience with CephFS). > > > So: > > > > > > 128GB RAM (expandable to 256GB or more) > > > 2x E5-2643v4 > > > 2x 400GB DC S3710 > > > 1x Dual Port 40Gb Ethernet > > > > > > > 8x OSD: > > > > 128GB RAM > > > Use your savings above to make that 256GB for grate performance > > > improvements as hot objects stay in memory and so will all dir-entries > > > (in SLAB). > > > > > > > I like this idea. > > > > > > > > > > > 2x 200GB SSD for OS > > > Overkill really. Other than the normally rather terse OSD logs, nothing > > > much will ever be written to them. So 3510s or at most 3610s. > > > > > > > 2x 400GB P3700 for Journals > > > As discussed 2 months ago, this limits you to writes at half (or > > > quarter depending on your design and if you do LACP, vLAG) of what > > > your network is capable of. > > > OTOH, I wouldn't expect your 24 HDDs do to much better than 2GB/s > > > either (at least with filestore and bluestore is a year away at best). > > > So good enough, especially if you're read heavy. > > > > > > > Yeah, the thought is that we're going to be close to equilibrium. It's > > not too big a deal to add an extra card, so my plan was to expand to 3 if > > necessary after our pilot project. > > > Probably not needed, because the moment you're not doing synthetic > large sequential write only tests you will find that random writes and > reads (these can be offset up to a point by the large RAM) will slow your > HDDs down well below the speed of the NVMes. > > Christian > > > > > > > > > 24x 6TB Enterprise SATA > > > > 2x E5-2660v4 > > > > 1x Dual Port 40Gb Ethernet > > > > > > Regards, > > > > > > Christian > > > > > > > As always, I appreciate your comments and time. I'm looking forward to > > joining you and the rest of the community in operating a great Ceph > > environment. > > > > > > > -- > > > Christian Balzer Network/Systems Engineer > > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > > http://www.gol.com/ > > > > > > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com