Hi,

Just to add to the previous discussion, consumer SSDs like these can 
unfortunately be significantly *slower* than plain old HDDs for Ceph. This is 
because Ceph always uses SYNC writes to guarantee that data is on disk before 
returning.

Unfortunately NAND writes are intrinsically quite slow, and tri/quad-level SSDs 
are the worst of them all. Enterprise SSDs solve this by having 
power-loss-protection capacitors, which means they can safely return the data 
as written the second it is in the fast RAM on the device.

Cheap consumer SSDs fall in one of two categories:

1. The drive might lie and return the data as written as soon as it's in the 
write cache when a SYNC write is requested. This gives seemingly great 
performance ... until you have a power loss and your data is corrupted. 
Thankfully, very few drives do this today.
2. The drive treats the SYNC write correctly, which means it can't return until 
the request has been moved from cache to actual NAND memory, which is (very) 
slow.


The short story is likely that all drives without power-loss-protection should 
be avoided, because if the performance looks great, it might mean the drive 
falls in category #1 instead of being a magical & cheap solution.

There is unfortunately no inherent "best" SSD, but it depends on your usage. 
For instance, for our large data partitions we need a lot of space and high 
read performance, but we don't store/update the data that frequently, so we 
opted for Samsung PM883 drives that are only designed for 0.8 DWPD 
(disk-writes-per-day). In contrast, for metadata drives where we have more 
writes (but don't need a ton of storage), we use drives that can handle 3DWPD, 
like Samsung sm893.

Virtually all vendors have such different lines of drives, so you will need to 
start by analyzing how much data you expect to write per day relative to the 
total storage volume and get appropriate drives.

If you are operating a very read/write-intensive cluster with hundreds of 
operations in parallel you will benefit a lot from higher-IOPS-rate drives, but 
be aware that those theoretical numbers listed are typically only achieved for 
very large queue depths (i.e., always having 32-64 operations running in 
parallel).

Since you are currently using consumer SSD (which definitely don't have 
endurance to handle intensive IO anyway), my guess is that you might rather 
have a lower-end setup, and then good performance depends more on having 
consistent low latency for all operations (including to/from the network cards).

If I were to invest in new servers today, I would likely go with NVMe, mostly 
because it's the future and not *that* much more expensive, but for old servers 
almost any enterprise-class SSD with power-loss-protection from major vendors 
should be fine - but you need to analyse whether you need write-intensive disks 
or not.


Cheers,

Erik

--
Erik Lindahl <erik.lind...@gmail.com>
On 28 Dec 2022 at 08:44 +0100, hosseinz8...@yahoo.com <hosseinz8...@yahoo.com>, 
wrote:
> Thanks. I am planning to change all of my disks. But do you know enterprise 
> SSD Disk which is best in trade of between cost & iops performance?Which 
> model and brand.Thanks in advance.
> On Wednesday, December 28, 2022 at 08:44:34 AM GMT+3:30, Konstantin Shalygin 
> <k0...@k0ste.ru> wrote:
>
> Hi,
>
> The cache was gone, optimize is proceed. This is not enterprise device, you 
> should never use it with Ceph 🙂
>
>
> k
> Sent from my iPhone
>
> > On 27 Dec 2022, at 16:41, hosseinz8...@yahoo.com wrote:
> >
> > Thanks AnthonyI have a cluster with QLC SSD disks (Samsung QVO 860). The 
> > cluster works for 2 year. Now all OSDs return 12 iops when running tell 
> > bench which is very slow. But I Buy new QVO disks yesterday, and I added 
> > this new disk to cluster. For the first 1 hour, I got 100 iops from this 
> > new OSD. But after 1 Hour, this new disk (OSD) returns to iops 12 again as 
> > the same as other OLD OSDs.I can not imagine what happening?!!
> >     On Tuesday, December 27, 2022 at 12:18:07 AM GMT+3:30, Anthony D'Atri 
> > <a...@dreamsnake.net> wrote:
> >
> > My understanding is that when you ask an OSD to bench (via the admin 
> > socket), only that OSD executes, there is no replication.  Replication is a 
> > function of PGs.
> >
> > Thus, this is a narrowly-focused tool with both unique advantages and 
> > disadvantages.
> >
> >
> >
> > > > On Dec 26, 2022, at 12:47 PM, hosseinz8...@yahoo.com wrote:
> > > >
> > > > Hi experts,I want to know, when I execute ceph tell osd.x bench 
> > > > command, is replica 3 considered in the bench or not? I mean, for 
> > > > example in case of replica 3, when I executing tell bench command, 
> > > > replica 1 of bench data write to osd.x, replica 2 write to osd.y and 
> > > > replica 3 write to osd.z? If this is true, it means that I can not get 
> > > > benchmark of only one of my OSD in the cluster because the IOPS and 
> > > > throughput of 2 other for example slow OSDs will affect the result of 
> > > > tell bench command for my target OSD.Is that true?
> > > > Thanks in advance.
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to