Re: [ceph-users] [External Email] RE: Beginner questions

Paul Emmerich Thu, 16 Jan 2020 14:24:15 -0800

Discussing DB size requirements without knowing the exact cluster
requirements doesn't work.


Here are some real-world examples:

cluster1: CephFS, mostly large files, replicated x3
0.2% used for metadata

cluster2: radosgw, mix between replicated and erasure, mixed file sizes
(lots of tiny files, though)
1.3% used for metadata

The 4%-10% quoted in the docs are *not based on any actual usage data*,
they are just an absolute worst case estimate.


A 30 GB DB partition for a 12 TiB disk is 0.25% if the disk is completely
full (which it won't be) is sufficient for many use cases.
I think cluster2 with 1.3% is one of the highest metadata usages that I've
seen on an actual production cluster.
I can think of a setup that probably has more but I haven't ever explicitly
checked it.

The restriction to 3/30/300 is temporary and might be fixed in a future
release, so I'd just partition that disk into X DB devices.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 16, 2020 at 10:28 PM Bastiaan Visser <basti...@msbv.nl> wrote:

> Dave made a good point WAL + DB might end up a little over 60G, I would
> probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme
> drive is smart enough to spread the writes over all available capacity,
> mort recent nvme's are). I have not yet seen a WAL larger or even close to
> than a gigabyte.
>
> We don't even think about EC-coded pools on clusters with less than 6
> nodes (spindles, full SSD is another story).
> EC pools neer more processing resources  We usually settle with 1 gig per
> TB of storage on replicated only sluters, but whet EC polls are involved,
> we add at least 50% to that. Also make sure your processors are up for it.
>
> Do not base your calculations on a healthy cluster -> build to fail.
> How long are you willing to be in a degraded state on node failure.
> Especially when using many larger spindles. recovery time might be way
> longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you
> might end up with over 200 TB of traffic, on a 10Gig network that's roughly
> 2 and a half days to recover. IF your processors are not bottleneck due to
> EC parity calculations and all capacity is available for recovery (which is
> usually not the case, there is still production traffic that will eat up
> resources).
>
> Op do 16 jan. 2020 om 21:30 schreef <dhils...@performair.com>:
>
>> Dave;
>>
>> I don't like reading inline responses, so...
>>
>> I have zero experience with EC pools, so I won't pretend to give advice
>> in that area.
>>
>> I would think that small NVMe for DB would be better than nothing, but I
>> don't know.
>>
>> Once I got the hang of building clusters, it was relatively easy to wipe
>> a cluster out and rebuild it.  Perhaps you could take some time, and
>> benchmark different configurations?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA
>> Director – Information Technology
>> Perform Air International Inc.
>> dhils...@performair.com
>> www.PerformAir.com
>>
>>
>> -----Original Message-----
>> From: Dave Hall [mailto:kdh...@binghamton.edu]
>> Sent: Thursday, January 16, 2020 1:04 PM
>> To: Dominic Hilsbos; ceph-users@lists.ceph.com
>> Subject: Re: [External Email] RE: [ceph-users] Beginner questions
>>
>> Dominic,
>>
>> We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this
>> worked out to a DB size of something like 163GB per OSD. Allowing for
>> expansion to 12 drives brings it down to 124GB. So maybe just put the
>> WALs on NVMe and leave the DBs on the platters?
>>
>> Understood that we will want to move to more nodes rather than more
>> drives per node, but our funding is grant and donation based, so we may
>> end up adding drives in the short term.  The long term plan is to get to
>> separate MON/MGR/MDS nodes and 10s of OSD nodes.
>>
>> Due to our current low node count, we are considering erasure-coded PGs
>> rather than replicated in order to maximize usable space.  Any
>> guidelines or suggestions on this?
>>
>> Also, sorry for not replying inline.  I haven't done this much in a
>> while - I'll figure it out.
>>
>> Thanks.
>>
>> -Dave
>>
>> On 1/16/2020 2:48 PM, dhils...@performair.com wrote:
>> > Dave;
>> >
>> > I'd like to expand on this answer, briefly...
>> >
>> > The information in the docs is wrong.  There have been many discussions
>> about changing it, but no good alternative has been suggested, thus it
>> hasn't been changed.
>> >
>> > The 3rd party project that Ceph's BlueStore uses for its database
>> (RocksDB), apparently only uses DB sizes of 3GB, 30GB, and 300GB.  As Dave
>> mentions below, when RocksDB executes a compact operation, it creates a new
>> blob of the same target size, and writes the compacted data into it.  This
>> doubles the necessary space.  In addition, BlueStore places its Write Ahead
>> Log (WAL) into the fastest storage that is available to OSD daemon,  i.e.
>> NVMe if available.  Since this is done before the first compaction is
>> requested, the WAL can force compaction onto slower storage.
>> >
>> > Thus, the numbers I've had floating around in my head for our next
>> cluster are: 7GB, 66GB, and 630GB.  From all the discussion I've seen
>> around RocksDB, those seem like good, common sense targets.  Pick the
>> largest one that works for your setup.
>> >
>> > All that said... You would really want to pair a 600GB+ NVMe with 12TB
>> drives, otherwise your DB is almost guaranteed to overflow onto the
>> spinning drive, and affect performance.
>> >
>> > I became aware of most of this after we planned our clusters, so I
>> haven't tried it, YMMV.
>> >
>> > One final note: more hosts, and more spindles usually translates into
>> better cluster-wide performance.  I can't predict what the relatively low
>> client counts you're suggesting would impact that.
>> >
>> > Thank you,
>> >
>> > Dominic L. Hilsbos, MBA
>> > Director – Information Technology
>> > Perform Air International Inc.
>> > dhils...@performair.com
>> > www.PerformAir.com
>> >
>> >
>> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> Of Bastiaan Visser
>> > Sent: Thursday, January 16, 2020 10:55 AM
>> > To: Dave Hall
>> > Cc: ceph-users@lists.ceph.com
>> > Subject: Re: [ceph-users] Beginner questions
>> >
>> > I would definitely go for Nautilus. there are quite some
>> optimizations that went in after mimic.
>> >
>> > Bluestore DB size usually ends up at either 30 or 60 GB.
>> > 30 GB is one of the sweet spots during normal operation. But during
>> compaction, ceph writes the new data before removing the old, hence the
>> 60GB.
>> > Next sweetspot is 300/600GB. any size between 60 and 300 will never be
>> unused.
>> >
>> > DB Usage is also dependent on ceph usage, object storage is known to
>> use a lot more db space than rbd images for example.
>> >
>> > Op do 16 jan. 2020 om 17:46 schreef Dave Hall <kdh...@binghamton.edu>:
>> > Hello all.
>> > Sorry for the beginner questions...
>> > I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster
>> to store some research data.  It is expected that this cluster will grow
>> significantly in the next year, possibly to multiple petabytes and 10s of
>> nodes.  At this time I'm expected a relatively small number of clients,
>> with only one or two actively writing collected data - albeit at a high
>> volume per day.
>> > Currently I'm deploying on Debian 9 via ceph-ansible.
>> > Before I put this cluster into production I have a couple questions
>> based on my experience to date:
>> > Luminous, Mimic, or Nautilus?  I need stability for this deployment, so
>> I am sticking with Debian 9 since Debian 10 is fairly new, and I have been
>> hesitant to go with Nautilus.  Yet Mimic seems to have had a hard road on
>> Debian but for the efforts at Croit.
>> > • Statements on the Releases page are now making more sense to me, but
>> I would like to confirm that Nautilus is the right choice at this time?
>> > Bluestore DB size:  My nodes currently have 8 x 12TB drives (plus 4
>> empty bays) and a PCIe NVMe drive.  If I understand the suggested
>> calculation correctly, the DB size for a 12 TB Bluestore OSD would be
>> 480GB.  If my NVMe isn't big enough to provide this size, should I skip
>> provisioning the DBs on the NVMe, or should I give each OSD 1/12th of what
>> I have available?  Also, should I try to shift budget a bit to get more
>> NVMe as soon as I can, and redo the OSDs when sufficient NVMe is available?
>> > Thanks.
>> > -Dave
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [External Email] RE: Beginner questions

Reply via email to