Bastiaan,
Regarding EC pools: Our concern at 3 nodes is that 2-way replication
seems risky - if the two copies don't match, which one is corrupted.
However, 3-way replication on a 3 node cluster triples the price per
TB. Doing EC pools that are the equivalent of RAID-5 2+1 seems like
the right place to start as far as maximizing capacity is concerned,
although I do understand the potential time involved in rebuilding a 12
TB drive. Early on I'd be more concerned about a drive failure than
about a node failure.
Regarding the hardware, our nodes are single socket EPYC 7302 (16 core,
32 thread) with 128GB RAM. From what I recall reading I think the RAM,
at least, is a bit higher than recommended.
Question: Does a PG (EC or replicated) span multiple drives per node?
I haven't got to the point of understanding this part yet, so pardon the
totally naive question. I'll probably be conversant on this by Monday.
-Dave
Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)
On 1/16/2020 4:27 PM, Bastiaan Visser wrote:
Dave made a good point WAL + DB might end up a little over 60G, I
would probably go with ~70Gig partitions /LV's per OSD in your case.
(if the nvme drive is smart enough to spread the writes over all
available capacity, mort recent nvme's are). I have not yet seen a WAL
larger or even close to than a gigabyte.
We don't even think about EC-coded pools on clusters with less than 6
nodes (spindles, full SSD is another story).
EC pools neer more processing resources We usually settle with 1 gig
per TB of storage on replicated only sluters, but whet EC polls are
involved, we add at least 50% to that. Also make sure your processors
are up for it.
Do not base your calculations on a healthy cluster -> build to fail.
How long are you willing to be in a degraded state on node failure.
Especially when using many larger spindles. recovery time might be way
longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool
you might end up with over 200 TB of traffic, on a 10Gig network
that's roughly 2 and a half days to recover. IF your processors are
not bottleneck due to EC parity calculations and all capacity is
available for recovery (which is usually not the case, there is still
production traffic that will eat up resources).
Op do 16 jan. 2020 om 21:30 schreef <dhils...@performair.com
<mailto:dhils...@performair.com>>:
Dave;
I don't like reading inline responses, so...
I have zero experience with EC pools, so I won't pretend to give
advice in that area.
I would think that small NVMe for DB would be better than nothing,
but I don't know.
Once I got the hang of building clusters, it was relatively easy
to wipe a cluster out and rebuild it. Perhaps you could take some
time, and benchmark different configurations?
Thank you,
Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com <http://www.PerformAir.com>
-----Original Message-----
From: Dave Hall [mailto:kdh...@binghamton.edu
<mailto:kdh...@binghamton.edu>]
Sent: Thursday, January 16, 2020 1:04 PM
To: Dominic Hilsbos; ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>
Subject: Re: [External Email] RE: [ceph-users] Beginner questions
Dominic,
We ended up with a 1.6TB PCIe NVMe in each node. For 8 drives this
worked out to a DB size of something like 163GB per OSD. Allowing for
expansion to 12 drives brings it down to 124GB. So maybe just put the
WALs on NVMe and leave the DBs on the platters?
Understood that we will want to move to more nodes rather than more
drives per node, but our funding is grant and donation based, so
we may
end up adding drives in the short term. The long term plan is to
get to
separate MON/MGR/MDS nodes and 10s of OSD nodes.
Due to our current low node count, we are considering
erasure-coded PGs
rather than replicated in order to maximize usable space. Any
guidelines or suggestions on this?
Also, sorry for not replying inline. I haven't done this much in a
while - I'll figure it out.
Thanks.
-Dave
On 1/16/2020 2:48 PM, dhils...@performair.com
<mailto:dhils...@performair.com> wrote:
> Dave;
>
> I'd like to expand on this answer, briefly...
>
> The information in the docs is wrong. There have been many
discussions about changing it, but no good alternative has been
suggested, thus it hasn't been changed.
>
> The 3rd party project that Ceph's BlueStore uses for its
database (RocksDB), apparently only uses DB sizes of 3GB, 30GB,
and 300GB. As Dave mentions below, when RocksDB executes a
compact operation, it creates a new blob of the same target size,
and writes the compacted data into it. This doubles the necessary
space. In addition, BlueStore places its Write Ahead Log (WAL)
into the fastest storage that is available to OSD daemon, i.e.
NVMe if available. Since this is done before the first compaction
is requested, the WAL can force compaction onto slower storage.
>
> Thus, the numbers I've had floating around in my head for our
next cluster are: 7GB, 66GB, and 630GB. From all the discussion
I've seen around RocksDB, those seem like good, common sense
targets. Pick the largest one that works for your setup.
>
> All that said... You would really want to pair a 600GB+ NVMe
with 12TB drives, otherwise your DB is almost guaranteed to
overflow onto the spinning drive, and affect performance.
>
> I became aware of most of this after we planned our clusters, so
I haven't tried it, YMMV.
>
> One final note: more hosts, and more spindles usually translates
into better cluster-wide performance. I can't predict what the
relatively low client counts you're suggesting would impact that.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com <http://www.PerformAir.com>
>
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
<mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Bastiaan
Visser
> Sent: Thursday, January 16, 2020 10:55 AM
> To: Dave Hall
> Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Beginner questions
>
> I would definitely go for Nautilus. there are quite some
optimizations that went in after mimic.
>
> Bluestore DB size usually ends up at either 30 or 60 GB.
> 30 GB is one of the sweet spots during normal operation. But
during compaction, ceph writes the new data before removing the
old, hence the 60GB.
> Next sweetspot is 300/600GB. any size between 60 and 300 will
never be unused.
>
> DB Usage is also dependent on ceph usage, object storage is
known to use a lot more db space than rbd images for example.
>
> Op do 16 jan. 2020 om 17:46 schreef Dave Hall
<kdh...@binghamton.edu <mailto:kdh...@binghamton.edu>>:
> Hello all.
> Sorry for the beginner questions...
> I am in the process of setting up a small (3 nodes, 288TB) Ceph
cluster to store some research data. It is expected that this
cluster will grow significantly in the next year, possibly to
multiple petabytes and 10s of nodes. At this time I'm expected a
relatively small number of clients, with only one or two actively
writing collected data - albeit at a high volume per day.
> Currently I'm deploying on Debian 9 via ceph-ansible.
> Before I put this cluster into production I have a couple
questions based on my experience to date:
> Luminous, Mimic, or Nautilus? I need stability for this
deployment, so I am sticking with Debian 9 since Debian 10 is
fairly new, and I have been hesitant to go with Nautilus. Yet
Mimic seems to have had a hard road on Debian but for the efforts
at Croit.
> • Statements on the Releases page are now making more sense to
me, but I would like to confirm that Nautilus is the right choice
at this time?
> Bluestore DB size: My nodes currently have 8 x 12TB drives
(plus 4 empty bays) and a PCIe NVMe drive. If I understand the
suggested calculation correctly, the DB size for a 12 TB Bluestore
OSD would be 480GB. If my NVMe isn't big enough to provide this
size, should I skip provisioning the DBs on the NVMe, or should I
give each OSD 1/12th of what I have available? Also, should I try
to shift budget a bit to get more NVMe as soon as I can, and redo
the OSDs when sufficient NVMe is available?
> Thanks.
> -Dave
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com