Discussing DB size requirements without knowing the exact cluster requirements doesn't work.
Here are some real-world examples: cluster1: CephFS, mostly large files, replicated x3 0.2% used for metadata cluster2: radosgw, mix between replicated and erasure, mixed file sizes (lots of tiny files, though) 1.3% used for metadata The 4%-10% quoted in the docs are *not based on any actual usage data*, they are just an absolute worst case estimate. A 30 GB DB partition for a 12 TiB disk is 0.25% if the disk is completely full (which it won't be) is sufficient for many use cases. I think cluster2 with 1.3% is one of the highest metadata usages that I've seen on an actual production cluster. I can think of a setup that probably has more but I haven't ever explicitly checked it. The restriction to 3/30/300 is temporary and might be fixed in a future release, so I'd just partition that disk into X DB devices. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Jan 16, 2020 at 10:28 PM Bastiaan Visser <basti...@msbv.nl> wrote: > Dave made a good point WAL + DB might end up a little over 60G, I would > probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme > drive is smart enough to spread the writes over all available capacity, > mort recent nvme's are). I have not yet seen a WAL larger or even close to > than a gigabyte. > > We don't even think about EC-coded pools on clusters with less than 6 > nodes (spindles, full SSD is another story). > EC pools neer more processing resources We usually settle with 1 gig per > TB of storage on replicated only sluters, but whet EC polls are involved, > we add at least 50% to that. Also make sure your processors are up for it. > > Do not base your calculations on a healthy cluster -> build to fail. > How long are you willing to be in a degraded state on node failure. > Especially when using many larger spindles. recovery time might be way > longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you > might end up with over 200 TB of traffic, on a 10Gig network that's roughly > 2 and a half days to recover. IF your processors are not bottleneck due to > EC parity calculations and all capacity is available for recovery (which is > usually not the case, there is still production traffic that will eat up > resources). > > Op do 16 jan. 2020 om 21:30 schreef <dhils...@performair.com>: > >> Dave; >> >> I don't like reading inline responses, so... >> >> I have zero experience with EC pools, so I won't pretend to give advice >> in that area. >> >> I would think that small NVMe for DB would be better than nothing, but I >> don't know. >> >> Once I got the hang of building clusters, it was relatively easy to wipe >> a cluster out and rebuild it. Perhaps you could take some time, and >> benchmark different configurations? >> >> Thank you, >> >> Dominic L. Hilsbos, MBA >> Director – Information Technology >> Perform Air International Inc. >> dhils...@performair.com >> www.PerformAir.com >> >> >> -----Original Message----- >> From: Dave Hall [mailto:kdh...@binghamton.edu] >> Sent: Thursday, January 16, 2020 1:04 PM >> To: Dominic Hilsbos; ceph-users@lists.ceph.com >> Subject: Re: [External Email] RE: [ceph-users] Beginner questions >> >> Dominic, >> >> We ended up with a 1.6TB PCIe NVMe in each node. For 8 drives this >> worked out to a DB size of something like 163GB per OSD. Allowing for >> expansion to 12 drives brings it down to 124GB. So maybe just put the >> WALs on NVMe and leave the DBs on the platters? >> >> Understood that we will want to move to more nodes rather than more >> drives per node, but our funding is grant and donation based, so we may >> end up adding drives in the short term. The long term plan is to get to >> separate MON/MGR/MDS nodes and 10s of OSD nodes. >> >> Due to our current low node count, we are considering erasure-coded PGs >> rather than replicated in order to maximize usable space. Any >> guidelines or suggestions on this? >> >> Also, sorry for not replying inline. I haven't done this much in a >> while - I'll figure it out. >> >> Thanks. >> >> -Dave >> >> On 1/16/2020 2:48 PM, dhils...@performair.com wrote: >> > Dave; >> > >> > I'd like to expand on this answer, briefly... >> > >> > The information in the docs is wrong. There have been many discussions >> about changing it, but no good alternative has been suggested, thus it >> hasn't been changed. >> > >> > The 3rd party project that Ceph's BlueStore uses for its database >> (RocksDB), apparently only uses DB sizes of 3GB, 30GB, and 300GB. As Dave >> mentions below, when RocksDB executes a compact operation, it creates a new >> blob of the same target size, and writes the compacted data into it. This >> doubles the necessary space. In addition, BlueStore places its Write Ahead >> Log (WAL) into the fastest storage that is available to OSD daemon, i.e. >> NVMe if available. Since this is done before the first compaction is >> requested, the WAL can force compaction onto slower storage. >> > >> > Thus, the numbers I've had floating around in my head for our next >> cluster are: 7GB, 66GB, and 630GB. From all the discussion I've seen >> around RocksDB, those seem like good, common sense targets. Pick the >> largest one that works for your setup. >> > >> > All that said... You would really want to pair a 600GB+ NVMe with 12TB >> drives, otherwise your DB is almost guaranteed to overflow onto the >> spinning drive, and affect performance. >> > >> > I became aware of most of this after we planned our clusters, so I >> haven't tried it, YMMV. >> > >> > One final note: more hosts, and more spindles usually translates into >> better cluster-wide performance. I can't predict what the relatively low >> client counts you're suggesting would impact that. >> > >> > Thank you, >> > >> > Dominic L. Hilsbos, MBA >> > Director – Information Technology >> > Perform Air International Inc. >> > dhils...@performair.com >> > www.PerformAir.com >> > >> > >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> Of Bastiaan Visser >> > Sent: Thursday, January 16, 2020 10:55 AM >> > To: Dave Hall >> > Cc: ceph-users@lists.ceph.com >> > Subject: Re: [ceph-users] Beginner questions >> > >> > I would definitely go for Nautilus. there are quite some >> optimizations that went in after mimic. >> > >> > Bluestore DB size usually ends up at either 30 or 60 GB. >> > 30 GB is one of the sweet spots during normal operation. But during >> compaction, ceph writes the new data before removing the old, hence the >> 60GB. >> > Next sweetspot is 300/600GB. any size between 60 and 300 will never be >> unused. >> > >> > DB Usage is also dependent on ceph usage, object storage is known to >> use a lot more db space than rbd images for example. >> > >> > Op do 16 jan. 2020 om 17:46 schreef Dave Hall <kdh...@binghamton.edu>: >> > Hello all. >> > Sorry for the beginner questions... >> > I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster >> to store some research data. It is expected that this cluster will grow >> significantly in the next year, possibly to multiple petabytes and 10s of >> nodes. At this time I'm expected a relatively small number of clients, >> with only one or two actively writing collected data - albeit at a high >> volume per day. >> > Currently I'm deploying on Debian 9 via ceph-ansible. >> > Before I put this cluster into production I have a couple questions >> based on my experience to date: >> > Luminous, Mimic, or Nautilus? I need stability for this deployment, so >> I am sticking with Debian 9 since Debian 10 is fairly new, and I have been >> hesitant to go with Nautilus. Yet Mimic seems to have had a hard road on >> Debian but for the efforts at Croit. >> > • Statements on the Releases page are now making more sense to me, but >> I would like to confirm that Nautilus is the right choice at this time? >> > Bluestore DB size: My nodes currently have 8 x 12TB drives (plus 4 >> empty bays) and a PCIe NVMe drive. If I understand the suggested >> calculation correctly, the DB size for a 12 TB Bluestore OSD would be >> 480GB. If my NVMe isn't big enough to provide this size, should I skip >> provisioning the DBs on the NVMe, or should I give each OSD 1/12th of what >> I have available? Also, should I try to shift budget a bit to get more >> NVMe as soon as I can, and redo the OSDs when sufficient NVMe is available? >> > Thanks. >> > -Dave >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com