I have three comments on our CephFS deployment. Some background first, we have been using CephFS since Giant with some not so important data. We are using it more heavily now in Infernalis. We have our own raw data storage using the POSIX semantics and keep everything as basic as possible. Basically open, read, and write.
1st thing is if you have a lot of files or directories in a folder. The lookup can get slow, I would say when you get to about 5000 items you can feel the latency. Although traditionally this never has been ultra fast on regular file systems, but just be aware. 2nd We do see an increase in parallelization of reading and writing data compared to a tradition spinning raid file system. I think this is testimony of Ceph. 3rd When we do an upgrade to a mds, we basically have to stop all activity on cephfs to restart the MDS. Replaying the backlog when it is starting, if it is large, can eat a lot of memory and hope you don't hit swap. This does create some downtime for us but it usually isn't long. I am hoping for more improvements in MDS like HA and various other things to make it even better. On Thu, Jun 2, 2016 at 9:11 AM Brady Deetz <bde...@gmail.com> wrote: > On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <ch...@gol.com> wrote: > >> >> Hello, >> >> On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote: >> >> > Question: >> > I'm curious if there is anybody else out there running CephFS at the >> > scale I'm planning for. I'd like to know some of the issues you didn't >> > expect that I should be looking out for. I'd also like to simply see >> > when CephFS hasn't worked out and why. Basically, give me your war >> > stories. >> > >> Not me, but diligently search the archives, there are people with large >> CephFS deployments (despite the non-production status when they did them). >> Also look at the current horror story thread about what happens when you >> have huge directories. >> >> > >> > Problem Details: >> > Now that I'm out of my design phase and finished testing on VMs, I'm >> > ready to drop $100k on a pilo. I'd like to get some sense of confidence >> > from the community that this is going to work before I pull the trigger. >> > >> > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with >> > CephFS by this time next year (hopefully by December). My workload is a >> > mix of small and vary large files (100GB+ in size). We do fMRI analysis >> > on DICOM image sets as well as other physio data collected from >> > subjects. We also have plenty of spreadsheets, scripts, etc. Currently >> > 90% of our analysis is I/O bound and generally sequential. >> > >> There are other people here doing similar things (medical institutes, >> universities), again search the archives and maybe contact them directly. >> >> > In deploying Ceph, I am hoping to see more throughput than the 7320 can >> > currently provide. I'm also looking to get away from traditional >> > file-systems that require forklift upgrades. That's where Ceph really >> > shines for us. >> > >> > I don't have a total file count, but I do know that we have about 500k >> > directories. >> > >> > >> > Planned Architecture: >> > >> Well, we talked about this 2 months ago, you seem to have changed only a >> few things. >> So lets dissect this again... >> >> > Storage Interconnect: >> > Brocade VDX 6940 (40 gig) >> > >> Is this a flat (single) network for all the storage nodes? >> And then from these 40Gb/s switches links to the access switches? >> > > This will start as a single 40Gb/s switch with a single link to each node > (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch will > also be connected to several 10Gb/s and 1Gb/s access switches with dual > 40Gb/s uplinks. > > We do intend to segment the public and private networks using VLANs > untagged at the node. There are obviously many subnets on our network. The > 40Gb/s switch will handle routing for those networks. > > You can see list discussion in "Public and Private network over 1 > interface" May 23,2016 regarding some of this. > > >> >> > Access Switches for clients (servers): >> > Brocade VDX 6740 (10 gig) >> > >> > Access Switches for clients (workstations): >> > Brocade ICX 7450 >> > >> > 3x MON: >> > 128GB RAM >> > 2x 200GB SSD for OS >> > 2x 400GB P3700 for LevelDB >> > 2x E5-2660v4 >> > 1x Dual Port 40Gb Ethernet >> > >> Total overkill in the CPU core arena, fewer but faster cores would be more >> suited for this task. >> A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing like >> that, the closest one would be the E5-2643v4. >> >> Same for RAM, MON processes are pretty frugal. >> >> No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and thus >> the leveldb) and that's being overly generous in the speed/IOPS >> department. >> >> Note also that 40Gb/s isn't really needed here, alas latency and KISS do >> speak in favor of it, especially if you can afford it. >> > > Noted > > >> >> > 2x MDS: >> > 128GB RAM >> > 2x 200GB SSD for OS >> > 2x 400GB P3700 for LevelDB (is this necessary?) >> No, there isn't any persistent data with MDS, unlike what I assumed as >> well before reading up on it and trying it out for the first time. >> > > That's what I thought. For some reason, my VAR keeps throwing these on the > config. > > >> >> > 2x E5-2660v4 >> > 1x Dual Port 40Gb Ethernet >> > >> Dedicated MONs/MDS are often a waste, they are suggested to avoid people >> who don't know what they're doing from overloading things. >> >> So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make >> the first one a dedicated MON and give it the lowest IP. >> HW Specs as discussed above, make sure to use DIMMs that allow you to >> upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons >> (from my limited experience with CephFS). >> So: >> >> 128GB RAM (expandable to 256GB or more) >> 2x E5-2643v4 >> 2x 400GB DC S3710 >> 1x Dual Port 40Gb Ethernet >> >> > 8x OSD: >> > 128GB RAM >> Use your savings above to make that 256GB for grate performance >> improvements as hot objects stay in memory and so will all dir-entries (in >> SLAB). >> > > I like this idea. > > >> >> > 2x 200GB SSD for OS >> Overkill really. Other than the normally rather terse OSD logs, nothing >> much will ever be written to them. So 3510s or at most 3610s. >> >> > 2x 400GB P3700 for Journals >> As discussed 2 months ago, this limits you to writes at half (or quarter >> depending on your design and if you do LACP, vLAG) of what your network is >> capable of. >> OTOH, I wouldn't expect your 24 HDDs do to much better than 2GB/s either >> (at least with filestore and bluestore is a year away at best). >> So good enough, especially if you're read heavy. >> > > Yeah, the thought is that we're going to be close to equilibrium. It's not > too big a deal to add an extra card, so my plan was to expand to 3 if > necessary after our pilot project. > > >> >> > 24x 6TB Enterprise SATA >> > 2x E5-2660v4 >> > 1x Dual Port 40Gb Ethernet >> >> Regards, >> >> Christian >> > > As always, I appreciate your comments and time. I'm looking forward to > joining you and the rest of the community in operating a great Ceph > environment. > > >> -- >> Christian Balzer Network/Systems Engineer >> ch...@gol.com Global OnLine Japan/Rakuten Communications >> http://www.gol.com/ >> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com