Re: [ceph-users] Beginner questions

Bastiaan Visser Thu, 16 Jan 2020 21:56:29 -0800

There is no difference in allocation between replication or EC. If failure
domain is host, one osd per host ok s used for a PG. So if you use a 2+1 EC
profile with a host failure domain, you need 3 hosts for a healthy cluster.
The pool will go read-only when you have a failure (host or disk), or are
doing maintenance on a node (reboot). On a node failure there will be no
rebuilding, since there is no place to find a 3rd osd for a pg, so you'll
have to fix/replace the node before any writes will be accepted.


So yes, you can do a 2+1 EC pool on 3 nodes, you are paying the price in
reliability, flexibility and maybe performance. Only way to really know the
latter is benchmarking with your setup.

I think you will be fine on the hardware side. Memory recommendations swing
around between 512M and 1G per Tb storage.I usually go with 1 gig. But I
never use disks larger than 4Tb. On the cpu I always try to have a few more
cores than I have osd's in a machine. So 16 is fine in your case.


On Fri, Jan 17, 2020, 03:29 Dave Hall <kdh...@binghamton.edu> wrote:

> Bastiaan,
>
> Regarding EC pools:   Our concern at 3 nodes is that 2-way replication
> seems risky - if the two copies don't match, which one is corrupted.
> However,  3-way replication on a 3 node cluster triples the price per TB.
> Doing EC pools that are the equivalent of RAID-5 2+1 seems like the right
> place to start as far as maximizing capacity is concerned, although I do
> understand the potential time involved in rebuilding a 12 TB drive.  Early
> on I'd be more concerned about a drive failure than about a node failure.
>
> Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 32
> thread) with 128GB RAM.  From what I recall reading I think the RAM, at
> least, is a bit higher than recommended.
>
> Question:  Does a PG (EC or replicated) span multiple drives per node?  I
> haven't got to the point of understanding this part yet, so pardon the
> totally naive question.  I'll probably be conversant on this by Monday.
>
> -Dave
>
> Dave Hall
> Binghamton universitykdh...@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
>
>
>
> On 1/16/2020 4:27 PM, Bastiaan Visser wrote:
>
> Dave made a good point WAL + DB might end up a little over 60G, I would
> probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme
> drive is smart enough to spread the writes over all available capacity,
> mort recent nvme's are). I have not yet seen a WAL larger or even close to
> than a gigabyte.
>
> We don't even think about EC-coded pools on clusters with less than 6
> nodes (spindles, full SSD is another story).
> EC pools neer more processing resources  We usually settle with 1 gig per
> TB of storage on replicated only sluters, but whet EC polls are involved,
> we add at least 50% to that. Also make sure your processors are up for it.
>
> Do not base your calculations on a healthy cluster -> build to fail.
> How long are you willing to be in a degraded state on node failure.
> Especially when using many larger spindles. recovery time might be way
> longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you
> might end up with over 200 TB of traffic, on a 10Gig network that's roughly
> 2 and a half days to recover. IF your processors are not bottleneck due to
> EC parity calculations and all capacity is available for recovery (which is
> usually not the case, there is still production traffic that will eat up
> resources).
>
> Op do 16 jan. 2020 om 21:30 schreef <dhils...@performair.com>:
>
>> Dave;
>>
>> I don't like reading inline responses, so...
>>
>> I have zero experience with EC pools, so I won't pretend to give advice
>> in that area.
>>
>> I would think that small NVMe for DB would be better than nothing, but I
>> don't know.
>>
>> Once I got the hang of building clusters, it was relatively easy to wipe
>> a cluster out and rebuild it.  Perhaps you could take some time, and
>> benchmark different configurations?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA
>> Director – Information Technology
>> Perform Air International Inc.
>> dhils...@performair.com
>> www.PerformAir.com
>>
>>
>> -----Original Message-----
>> From: Dave Hall [mailto:kdh...@binghamton.edu]
>> Sent: Thursday, January 16, 2020 1:04 PM
>> To: Dominic Hilsbos; ceph-users@lists.ceph.com
>> Subject: Re: [External Email] RE: [ceph-users] Beginner questions
>>
>> Dominic,
>>
>> We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this
>> worked out to a DB size of something like 163GB per OSD. Allowing for
>> expansion to 12 drives brings it down to 124GB. So maybe just put the
>> WALs on NVMe and leave the DBs on the platters?
>>
>> Understood that we will want to move to more nodes rather than more
>> drives per node, but our funding is grant and donation based, so we may
>> end up adding drives in the short term.  The long term plan is to get to
>> separate MON/MGR/MDS nodes and 10s of OSD nodes.
>>
>> Due to our current low node count, we are considering erasure-coded PGs
>> rather than replicated in order to maximize usable space.  Any
>> guidelines or suggestions on this?
>>
>> Also, sorry for not replying inline.  I haven't done this much in a
>> while - I'll figure it out.
>>
>> Thanks.
>>
>> -Dave
>>
>> On 1/16/2020 2:48 PM, dhils...@performair.com wrote:
>> > Dave;
>> >
>> > I'd like to expand on this answer, briefly...
>> >
>> > The information in the docs is wrong.  There have been many discussions
>> about changing it, but no good alternative has been suggested, thus it
>> hasn't been changed.
>> >
>> > The 3rd party project that Ceph's BlueStore uses for its database
>> (RocksDB), apparently only uses DB sizes of 3GB, 30GB, and 300GB.  As Dave
>> mentions below, when RocksDB executes a compact operation, it creates a new
>> blob of the same target size, and writes the compacted data into it.  This
>> doubles the necessary space.  In addition, BlueStore places its Write Ahead
>> Log (WAL) into the fastest storage that is available to OSD daemon,  i.e.
>> NVMe if available.  Since this is done before the first compaction is
>> requested, the WAL can force compaction onto slower storage.
>> >
>> > Thus, the numbers I've had floating around in my head for our next
>> cluster are: 7GB, 66GB, and 630GB.  From all the discussion I've seen
>> around RocksDB, those seem like good, common sense targets.  Pick the
>> largest one that works for your setup.
>> >
>> > All that said... You would really want to pair a 600GB+ NVMe with 12TB
>> drives, otherwise your DB is almost guaranteed to overflow onto the
>> spinning drive, and affect performance.
>> >
>> > I became aware of most of this after we planned our clusters, so I
>> haven't tried it, YMMV.
>> >
>> > One final note: more hosts, and more spindles usually translates into
>> better cluster-wide performance.  I can't predict what the relatively low
>> client counts you're suggesting would impact that.
>> >
>> > Thank you,
>> >
>> > Dominic L. Hilsbos, MBA
>> > Director – Information Technology
>> > Perform Air International Inc.
>> > dhils...@performair.com
>> > www.PerformAir.com
>> >
>> >
>> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> Of Bastiaan Visser
>> > Sent: Thursday, January 16, 2020 10:55 AM
>> > To: Dave Hall
>> > Cc: ceph-users@lists.ceph.com
>> > Subject: Re: [ceph-users] Beginner questions
>> >
>> > I would definitely go for Nautilus. there are quite some
>> optimizations that went in after mimic.
>> >
>> > Bluestore DB size usually ends up at either 30 or 60 GB.
>> > 30 GB is one of the sweet spots during normal operation. But during
>> compaction, ceph writes the new data before removing the old, hence the
>> 60GB.
>> > Next sweetspot is 300/600GB. any size between 60 and 300 will never be
>> unused.
>> >
>> > DB Usage is also dependent on ceph usage, object storage is known to
>> use a lot more db space than rbd images for example.
>> >
>> > Op do 16 jan. 2020 om 17:46 schreef Dave Hall <kdh...@binghamton.edu>:
>> > Hello all.
>> > Sorry for the beginner questions...
>> > I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster
>> to store some research data.  It is expected that this cluster will grow
>> significantly in the next year, possibly to multiple petabytes and 10s of
>> nodes.  At this time I'm expected a relatively small number of clients,
>> with only one or two actively writing collected data - albeit at a high
>> volume per day.
>> > Currently I'm deploying on Debian 9 via ceph-ansible.
>> > Before I put this cluster into production I have a couple questions
>> based on my experience to date:
>> > Luminous, Mimic, or Nautilus?  I need stability for this deployment, so
>> I am sticking with Debian 9 since Debian 10 is fairly new, and I have been
>> hesitant to go with Nautilus.  Yet Mimic seems to have had a hard road on
>> Debian but for the efforts at Croit.
>> > • Statements on the Releases page are now making more sense to me, but
>> I would like to confirm that Nautilus is the right choice at this time?
>> > Bluestore DB size:  My nodes currently have 8 x 12TB drives (plus 4
>> empty bays) and a PCIe NVMe drive.  If I understand the suggested
>> calculation correctly, the DB size for a 12 TB Bluestore OSD would be
>> 480GB.  If my NVMe isn't big enough to provide this size, should I skip
>> provisioning the DBs on the NVMe, or should I give each OSD 1/12th of what
>> I have available?  Also, should I try to shift budget a bit to get more
>> NVMe as soon as I can, and redo the OSDs when sufficient NVMe is available?
>> > Thanks.
>> > -Dave
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Beginner questions

Reply via email to