There is no difference in allocation between replication or EC. If failure domain is host, one osd per host ok s used for a PG. So if you use a 2+1 EC profile with a host failure domain, you need 3 hosts for a healthy cluster. The pool will go read-only when you have a failure (host or disk), or are doing maintenance on a node (reboot). On a node failure there will be no rebuilding, since there is no place to find a 3rd osd for a pg, so you'll have to fix/replace the node before any writes will be accepted.
So yes, you can do a 2+1 EC pool on 3 nodes, you are paying the price in reliability, flexibility and maybe performance. Only way to really know the latter is benchmarking with your setup. I think you will be fine on the hardware side. Memory recommendations swing around between 512M and 1G per Tb storage.I usually go with 1 gig. But I never use disks larger than 4Tb. On the cpu I always try to have a few more cores than I have osd's in a machine. So 16 is fine in your case. On Fri, Jan 17, 2020, 03:29 Dave Hall <kdh...@binghamton.edu> wrote: > Bastiaan, > > Regarding EC pools: Our concern at 3 nodes is that 2-way replication > seems risky - if the two copies don't match, which one is corrupted. > However, 3-way replication on a 3 node cluster triples the price per TB. > Doing EC pools that are the equivalent of RAID-5 2+1 seems like the right > place to start as far as maximizing capacity is concerned, although I do > understand the potential time involved in rebuilding a 12 TB drive. Early > on I'd be more concerned about a drive failure than about a node failure. > > Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 32 > thread) with 128GB RAM. From what I recall reading I think the RAM, at > least, is a bit higher than recommended. > > Question: Does a PG (EC or replicated) span multiple drives per node? I > haven't got to the point of understanding this part yet, so pardon the > totally naive question. I'll probably be conversant on this by Monday. > > -Dave > > Dave Hall > Binghamton universitykdh...@binghamton.edu > 607-760-2328 (Cell) > 607-777-4641 (Office) > > > > On 1/16/2020 4:27 PM, Bastiaan Visser wrote: > > Dave made a good point WAL + DB might end up a little over 60G, I would > probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme > drive is smart enough to spread the writes over all available capacity, > mort recent nvme's are). I have not yet seen a WAL larger or even close to > than a gigabyte. > > We don't even think about EC-coded pools on clusters with less than 6 > nodes (spindles, full SSD is another story). > EC pools neer more processing resources We usually settle with 1 gig per > TB of storage on replicated only sluters, but whet EC polls are involved, > we add at least 50% to that. Also make sure your processors are up for it. > > Do not base your calculations on a healthy cluster -> build to fail. > How long are you willing to be in a degraded state on node failure. > Especially when using many larger spindles. recovery time might be way > longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you > might end up with over 200 TB of traffic, on a 10Gig network that's roughly > 2 and a half days to recover. IF your processors are not bottleneck due to > EC parity calculations and all capacity is available for recovery (which is > usually not the case, there is still production traffic that will eat up > resources). > > Op do 16 jan. 2020 om 21:30 schreef <dhils...@performair.com>: > >> Dave; >> >> I don't like reading inline responses, so... >> >> I have zero experience with EC pools, so I won't pretend to give advice >> in that area. >> >> I would think that small NVMe for DB would be better than nothing, but I >> don't know. >> >> Once I got the hang of building clusters, it was relatively easy to wipe >> a cluster out and rebuild it. Perhaps you could take some time, and >> benchmark different configurations? >> >> Thank you, >> >> Dominic L. Hilsbos, MBA >> Director – Information Technology >> Perform Air International Inc. >> dhils...@performair.com >> www.PerformAir.com >> >> >> -----Original Message----- >> From: Dave Hall [mailto:kdh...@binghamton.edu] >> Sent: Thursday, January 16, 2020 1:04 PM >> To: Dominic Hilsbos; ceph-users@lists.ceph.com >> Subject: Re: [External Email] RE: [ceph-users] Beginner questions >> >> Dominic, >> >> We ended up with a 1.6TB PCIe NVMe in each node. For 8 drives this >> worked out to a DB size of something like 163GB per OSD. Allowing for >> expansion to 12 drives brings it down to 124GB. So maybe just put the >> WALs on NVMe and leave the DBs on the platters? >> >> Understood that we will want to move to more nodes rather than more >> drives per node, but our funding is grant and donation based, so we may >> end up adding drives in the short term. The long term plan is to get to >> separate MON/MGR/MDS nodes and 10s of OSD nodes. >> >> Due to our current low node count, we are considering erasure-coded PGs >> rather than replicated in order to maximize usable space. Any >> guidelines or suggestions on this? >> >> Also, sorry for not replying inline. I haven't done this much in a >> while - I'll figure it out. >> >> Thanks. >> >> -Dave >> >> On 1/16/2020 2:48 PM, dhils...@performair.com wrote: >> > Dave; >> > >> > I'd like to expand on this answer, briefly... >> > >> > The information in the docs is wrong. There have been many discussions >> about changing it, but no good alternative has been suggested, thus it >> hasn't been changed. >> > >> > The 3rd party project that Ceph's BlueStore uses for its database >> (RocksDB), apparently only uses DB sizes of 3GB, 30GB, and 300GB. As Dave >> mentions below, when RocksDB executes a compact operation, it creates a new >> blob of the same target size, and writes the compacted data into it. This >> doubles the necessary space. In addition, BlueStore places its Write Ahead >> Log (WAL) into the fastest storage that is available to OSD daemon, i.e. >> NVMe if available. Since this is done before the first compaction is >> requested, the WAL can force compaction onto slower storage. >> > >> > Thus, the numbers I've had floating around in my head for our next >> cluster are: 7GB, 66GB, and 630GB. From all the discussion I've seen >> around RocksDB, those seem like good, common sense targets. Pick the >> largest one that works for your setup. >> > >> > All that said... You would really want to pair a 600GB+ NVMe with 12TB >> drives, otherwise your DB is almost guaranteed to overflow onto the >> spinning drive, and affect performance. >> > >> > I became aware of most of this after we planned our clusters, so I >> haven't tried it, YMMV. >> > >> > One final note: more hosts, and more spindles usually translates into >> better cluster-wide performance. I can't predict what the relatively low >> client counts you're suggesting would impact that. >> > >> > Thank you, >> > >> > Dominic L. Hilsbos, MBA >> > Director – Information Technology >> > Perform Air International Inc. >> > dhils...@performair.com >> > www.PerformAir.com >> > >> > >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> Of Bastiaan Visser >> > Sent: Thursday, January 16, 2020 10:55 AM >> > To: Dave Hall >> > Cc: ceph-users@lists.ceph.com >> > Subject: Re: [ceph-users] Beginner questions >> > >> > I would definitely go for Nautilus. there are quite some >> optimizations that went in after mimic. >> > >> > Bluestore DB size usually ends up at either 30 or 60 GB. >> > 30 GB is one of the sweet spots during normal operation. But during >> compaction, ceph writes the new data before removing the old, hence the >> 60GB. >> > Next sweetspot is 300/600GB. any size between 60 and 300 will never be >> unused. >> > >> > DB Usage is also dependent on ceph usage, object storage is known to >> use a lot more db space than rbd images for example. >> > >> > Op do 16 jan. 2020 om 17:46 schreef Dave Hall <kdh...@binghamton.edu>: >> > Hello all. >> > Sorry for the beginner questions... >> > I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster >> to store some research data. It is expected that this cluster will grow >> significantly in the next year, possibly to multiple petabytes and 10s of >> nodes. At this time I'm expected a relatively small number of clients, >> with only one or two actively writing collected data - albeit at a high >> volume per day. >> > Currently I'm deploying on Debian 9 via ceph-ansible. >> > Before I put this cluster into production I have a couple questions >> based on my experience to date: >> > Luminous, Mimic, or Nautilus? I need stability for this deployment, so >> I am sticking with Debian 9 since Debian 10 is fairly new, and I have been >> hesitant to go with Nautilus. Yet Mimic seems to have had a hard road on >> Debian but for the efforts at Croit. >> > • Statements on the Releases page are now making more sense to me, but >> I would like to confirm that Nautilus is the right choice at this time? >> > Bluestore DB size: My nodes currently have 8 x 12TB drives (plus 4 >> empty bays) and a PCIe NVMe drive. If I understand the suggested >> calculation correctly, the DB size for a 12 TB Bluestore OSD would be >> 480GB. If my NVMe isn't big enough to provide this size, should I skip >> provisioning the DBs on the NVMe, or should I give each OSD 1/12th of what >> I have available? Also, should I try to shift budget a bit to get more >> NVMe as soon as I can, and redo the OSDs when sufficient NVMe is available? >> > Thanks. >> > -Dave >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com