[ceph-users] Re: Why you might want packages not containers for Ceph deployments
Not using cephadm, I would also question other things like: - If it uses docker and docker daemon fails what happens to you containers? - I assume the ceph-osd containers need linux capability sysadmin. So if you have to allow this via your OC, all your tasks have potentially access to this permission. (That is why I chose not to allow the OC access to it) - cephadm only runs with docker? > -Original Message- > From: Martin Verges > Sent: 02 June 2021 13:29 > To: Matthew Vernon > Cc: ceph-users@ceph.io > Subject: [ceph-users] Re: Why you might want packages not containers for > Ceph deployments > > Hello, > > I agree to Matthew, here at croit we work a lot with containers all day > long. No problem with that and enough knowledge to say for sure it's not > about getting used to it. > For us and our decisions here, Storage is the most valuable piece of IT > equipment in a company. If you have problems with your storage, most > likely > you have a huge pain, costs, problems, downtime, whatever. Therefore, > your > storage solution must be damn simple, you switch it on, it has to work. > > If you take a short look into Ceph documentation about how to deploy a > cephadm cluster vs croit. We strongly believe it's much easier as we > take > away all the pain from OS up to Ceph while keeping it simple behind the > scene. You still can always login to a node, kill a process, attach some > strace or whatever you like as you know it from years of linux > administration without any complexity layers like docker/podman/... It's > just friction less. In the end, what do you need? A kernel, an > initramfs, > some systemd, a bit of libs and tooling, and the Ceph packages. > > In addition, we help lot's of Ceph users on a regular basis with their > hand > made setups, but we don't really wanna touch the cephadm ones, as they > are > often harder to debug. But of course we do it anyways :). > > To have a perfect storage, strip away anything unneccessary. Avoid any > complexity, avoid anything that might affect your system. Keep it simply > stupid. > > -- > Martin Verges > Managing director > > Mobile: +49 174 9335695 > E-Mail: martin.ver...@croit.io > Chat: https://t.me/MartinVerges > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > > Web: https://croit.io > YouTube: https://goo.gl/PGE1Bx > > > On Wed, 2 Jun 2021 at 11:38, Matthew Vernon wrote: > > > Hi, > > > > In the discussion after the Ceph Month talks yesterday, there was a > bit > > of chat about cephadm / containers / packages. IIRC, Sage observed > that > > a common reason in the recent user survey for not using cephadm was > that > > it only worked on containerised deployments. I think he then went on > to > > say that he hadn't heard any compelling reasons why not to use > > containers, and suggested that resistance was essentially a user > > education question[0]. > > > > I'd like to suggest, briefly, that: > > > > * containerised deployments are more complex to manage, and this is > not > > simply a matter of familiarity > > * reducing the complexity of systems makes admins' lives easier > > * the trade-off of the pros and cons of containers vs packages is not > > obvious, and will depend on deployment needs > > * Ceph users will benefit from both approaches being supported into > the > > future > > > > We make extensive use of containers at Sanger, particularly for > > scientific workflows, and also for bundling some web apps (e.g. > > Grafana). We've also looked at a number of container runtimes (Docker, > > singularity, charliecloud). They do have advantages - it's easy to > > distribute a complex userland in a way that will run on (almost) any > > target distribution; rapid "cloud" deployment; some separation (via > > namespaces) of network/users/processes. > > > > For what I think of as a 'boring' Ceph deploy (i.e. install on a set > of > > dedicated hardware and then run for a long time), I'm not sure any of > > these benefits are particularly relevant and/or compelling - Ceph > > upstream produce Ubuntu .debs and Canonical (via their Ubuntu Cloud > > Archive) provide .debs of a couple of different Ceph releases per > Ubuntu > > LTS - meaning we can easily separate out OS upgrade from Ceph upgrade. > > And upgrading the Ceph packages _doesn't_ restart the daemons[1], > > meaning that we maintain control over restart order during an upgrade. > > And while we might briefly install packages from a PPA or similar to > > test a bugfix, we roll those (test-)cluster-wide, rather than trying > to > > run a mixed set of versions on a single cluster - and I understand > this > > single-version approach is best practice. > > > > Deployment via containers does bring complexity; some examples we've > > found at Sanger (not all Ceph-related, which we run from packages): > > > > * you now have 2 process supervision points - dockerd and systemd > > * dock
[ceph-users] radosgw-admin bucket delete linear memory growth?
I am seeing a huge usage of ram, while my bucket delete is churning over left-over multiparts, and while I realize there are *many* being done a 1000 at a time, like this: 2021-06-03 07:29:06.408 7f9b7f633240 0 abort_bucket_multiparts WARNING : aborted 254000 incomplete multipart uploads ..my first run ended radosgw-admin with out-of-mem, so it seems some part of this keeps data or forgets to free up old parts of the lists after they have been cancelled? I also don't know if restarting means it iterates over the same ones again, or if this log line means 254k are processed and if I need to restart again, it will at least have made a few hours of progress. At this point, RES is 2.1g so roughly 1 bytes per "entry" if it is linear somehow. ceph 13.2.10 in this case. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why you might want packages not containers for Ceph deployments
Podman containers will not restart due to restart or failure of centralized podman daemon. Container is not synonymous to Docker. This thread reminds me systemd haters threads more and more by I guess it is fine. On Thu, Jun 3, 2021, 2:16 AM Marc wrote: > Not using cephadm, I would also question other things like: > > - If it uses docker and docker daemon fails what happens to you containers? > - I assume the ceph-osd containers need linux capability sysadmin. So if > you have to allow this via your OC, all your tasks have potentially access > to this permission. (That is why I chose not to allow the OC access to it) > - cephadm only runs with docker? > > > > > > -Original Message- > > From: Martin Verges > > Sent: 02 June 2021 13:29 > > To: Matthew Vernon > > Cc: ceph-users@ceph.io > > Subject: [ceph-users] Re: Why you might want packages not containers for > > Ceph deployments > > > > Hello, > > > > I agree to Matthew, here at croit we work a lot with containers all day > > long. No problem with that and enough knowledge to say for sure it's not > > about getting used to it. > > For us and our decisions here, Storage is the most valuable piece of IT > > equipment in a company. If you have problems with your storage, most > > likely > > you have a huge pain, costs, problems, downtime, whatever. Therefore, > > your > > storage solution must be damn simple, you switch it on, it has to work. > > > > If you take a short look into Ceph documentation about how to deploy a > > cephadm cluster vs croit. We strongly believe it's much easier as we > > take > > away all the pain from OS up to Ceph while keeping it simple behind the > > scene. You still can always login to a node, kill a process, attach some > > strace or whatever you like as you know it from years of linux > > administration without any complexity layers like docker/podman/... It's > > just friction less. In the end, what do you need? A kernel, an > > initramfs, > > some systemd, a bit of libs and tooling, and the Ceph packages. > > > > In addition, we help lot's of Ceph users on a regular basis with their > > hand > > made setups, but we don't really wanna touch the cephadm ones, as they > > are > > often harder to debug. But of course we do it anyways :). > > > > To have a perfect storage, strip away anything unneccessary. Avoid any > > complexity, avoid anything that might affect your system. Keep it simply > > stupid. > > > > -- > > Martin Verges > > Managing director > > > > Mobile: +49 174 9335695 > > E-Mail: martin.ver...@croit.io > > Chat: https://t.me/MartinVerges > > > > croit GmbH, Freseniusstr. 31h, 81247 Munich > > CEO: Martin Verges - VAT-ID: DE310638492 > > Com. register: Amtsgericht Munich HRB 231263 > > > > Web: https://croit.io > > YouTube: https://goo.gl/PGE1Bx > > > > > > On Wed, 2 Jun 2021 at 11:38, Matthew Vernon wrote: > > > > > Hi, > > > > > > In the discussion after the Ceph Month talks yesterday, there was a > > bit > > > of chat about cephadm / containers / packages. IIRC, Sage observed > > that > > > a common reason in the recent user survey for not using cephadm was > > that > > > it only worked on containerised deployments. I think he then went on > > to > > > say that he hadn't heard any compelling reasons why not to use > > > containers, and suggested that resistance was essentially a user > > > education question[0]. > > > > > > I'd like to suggest, briefly, that: > > > > > > * containerised deployments are more complex to manage, and this is > > not > > > simply a matter of familiarity > > > * reducing the complexity of systems makes admins' lives easier > > > * the trade-off of the pros and cons of containers vs packages is not > > > obvious, and will depend on deployment needs > > > * Ceph users will benefit from both approaches being supported into > > the > > > future > > > > > > We make extensive use of containers at Sanger, particularly for > > > scientific workflows, and also for bundling some web apps (e.g. > > > Grafana). We've also looked at a number of container runtimes (Docker, > > > singularity, charliecloud). They do have advantages - it's easy to > > > distribute a complex userland in a way that will run on (almost) any > > > target distribution; rapid "cloud" deployment; some separation (via > > > namespaces) of network/users/processes. > > > > > > For what I think of as a 'boring' Ceph deploy (i.e. install on a set > > of > > > dedicated hardware and then run for a long time), I'm not sure any of > > > these benefits are particularly relevant and/or compelling - Ceph > > > upstream produce Ubuntu .debs and Canonical (via their Ubuntu Cloud > > > Archive) provide .debs of a couple of different Ceph releases per > > Ubuntu > > > LTS - meaning we can easily separate out OS upgrade from Ceph upgrade. > > > And upgrading the Ceph packages _doesn't_ restart the daemons[1], > > > meaning that we maintain control over restart order during an upgrade. > > > And while we might briefly install
[ceph-users] Re: Can we deprecate FileStore in Quincy?
Hi folks, I'm fine with dropping Filestore in the R release! Only one thing to add is: please add a warning to all versions we can upgrade from to the R release son not only Quincy but also pacific! Thanks, Ansgar Neha Ojha schrieb am Di., 1. Juni 2021, 21:24: > Hello everyone, > > Given that BlueStore has been the default and more widely used > objectstore since quite some time, we would like to understand whether > we can consider deprecating FileStore in our next release, Quincy and > remove it in the R release. There is also a proposal [0] to add a > health warning to report FileStore OSDs. > > We discussed this topic in the Ceph Month session today [1] and there > were no objections from anybody on the call. I wanted to reach out to > the list to check if there are any concerns about this or any users > who will be impacted by this decision. > > Thanks, > Neha > > [0] https://github.com/ceph/ceph/pull/39440 > [1] https://pad.ceph.com/p/ceph-month-june-2021 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why you might want packages not containers for Ceph deployments
Dera Sasha and for everyone else as well, Sasha Litvak writes: > Podman containers will not restart due to restart or failure of centralized > podman daemon. Container is not synonymous to Docker. This thread reminds > me systemd haters threads more and more by I guess it is fine. calling people "haters" who have reservations or critique is simply wrong, harmful and disrespectful. I ask you to retract above statement immediately for the sake of the ceph and open source community. Nico -- Sustainable and modern Infrastructures by ungleich.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Can we deprecate FileStore in Quincy?
On 2-6-2021 21:56, Neha Ojha wrote: On Wed, Jun 2, 2021 at 12:31 PM Willem Jan Withagen wrote: On 1-6-2021 21:24, Neha Ojha wrote: Hello everyone, Given that BlueStore has been the default and more widely used objectstore since quite some time, we would like to understand whether we can consider deprecating FileStore in our next release, Quincy and remove it in the R release. There is also a proposal [0] to add a health warning to report FileStore OSDs. We discussed this topic in the Ceph Month session today [1] and there were no objections from anybody on the call. I wanted to reach out to the list to check if there are any concerns about this or any users who will be impacted by this decision. That means that I need really get going on finishing the bluestore stuff in the FreeBSD port. Getting things in has been slow, real slow, also due to me being bussy with company stuff. So when is the expected "kill filestore" date? Currently, the proposal is to remove it in the R release (one release after deprecation), which should be in March 2023. Would that give you enough time to migrate? If I can't get it in by then, that would be odd. In concept I already had it working but getting the code in is sometimes hard because the panels are sliding at quite some pace, and I do not always have time to keep up due to work requirements. I'll probably modify/conditionalise the deperication remark for FreeBSD, not to scare the few testing users that are there. --WjW ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why you might want packages not containers for Ceph deployments
My Cephadm deployment on RHEL8 created a service for each container, complete with restarts. And on the host, the processes run under the 'ceph' user account. The biggest issue I had with running as containers is that the unit.run script generated runs podman -rm ... with the -rm, the logs are removed when there is an issue, so it took extra effort to figure out what the issue I was having on one machine was. (Although that turned out to be a bad memory chip, which would have manifested itself anyway). As a manager of a team who develops microservices in containers, I have a mixed attitude towards them - the ability to know the version of Java, and support libraries deployed for a given container, and isolate updates one image at a time can be a bonus, but with my client's CI/CD pipeline and how they want our containers to be built, simple tasks like upgrading a package version, upgrading the version of Java or even a simple replacement of a certificate has become significantly more difficult, because we need to rebuild all of the containers and go through the QA processes rather than update a cert. For my usage, (At home running Ceph on older hardware I converted to servers) I don't want to have to care about ceph dependencies and also isolate things from other things running on the server, so a container infrastructure works well, but I can see where packages can be much better in a well maintained server infrastructure. -Original Message- From: Sasha Litvak Sent: Thursday, June 3, 2021 3:57 AM To: ceph-users Subject: [ceph-users] Re: Why you might want packages not containers for Ceph deployments Podman containers will not restart due to restart or failure of centralized podman daemon. Container is not synonymous to Docker. This thread reminds me systemd haters threads more and more by I guess it is fine. On Thu, Jun 3, 2021, 2:16 AM Marc wrote: > Not using cephadm, I would also question other things like: > > - If it uses docker and docker daemon fails what happens to you containers? > - I assume the ceph-osd containers need linux capability sysadmin. So > if you have to allow this via your OC, all your tasks have potentially > access to this permission. (That is why I chose not to allow the OC > access to it) > - cephadm only runs with docker? > > > > > > -Original Message- > > From: Martin Verges > > Sent: 02 June 2021 13:29 > > To: Matthew Vernon > > Cc: ceph-users@ceph.io > > Subject: [ceph-users] Re: Why you might want packages not containers > > for Ceph deployments > > > > Hello, > > > > I agree to Matthew, here at croit we work a lot with containers all > > day long. No problem with that and enough knowledge to say for sure > > it's not about getting used to it. > > For us and our decisions here, Storage is the most valuable piece of > > IT equipment in a company. If you have problems with your storage, > > most likely you have a huge pain, costs, problems, downtime, > > whatever. Therefore, your storage solution must be damn simple, you > > switch it on, it has to work. > > > > If you take a short look into Ceph documentation about how to deploy > > a cephadm cluster vs croit. We strongly believe it's much easier as > > we take away all the pain from OS up to Ceph while keeping it simple > > behind the scene. You still can always login to a node, kill a > > process, attach some strace or whatever you like as you know it from > > years of linux administration without any complexity layers like > > docker/podman/... It's just friction less. In the end, what do you > > need? A kernel, an initramfs, some systemd, a bit of libs and > > tooling, and the Ceph packages. > > > > In addition, we help lot's of Ceph users on a regular basis with > > their hand made setups, but we don't really wanna touch the cephadm > > ones, as they are often harder to debug. But of course we do it > > anyways :). > > > > To have a perfect storage, strip away anything unneccessary. Avoid > > any complexity, avoid anything that might affect your system. Keep > > it simply stupid. > > > > -- > > Martin Verges > > Managing director > > > > Mobile: +49 174 9335695 > > E-Mail: martin.ver...@croit.io > > Chat: https://t.me/MartinVerges > > > > croit GmbH, Freseniusstr. 31h, 81247 Munich > > CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht > > Munich HRB 231263 > > > > Web: https://croit.io > > YouTube: https://goo.gl/PGE1Bx > > > > > > On Wed, 2 Jun 2021 at 11:38, Matthew Vernon wrote: > > > > > Hi, > > > > > > In the discussion after the Ceph Month talks yesterday, there was > > > a > > bit > > > of chat about cephadm / containers / packages. IIRC, Sage observed > > that > > > a common reason in the recent user survey for not using cephadm > > > was > > that > > > it only worked on containerised deployments. I think he then went > > > on > > to > > > say that he hadn't heard any compelling reasons why not to use > > > conta
[ceph-users] Re: Can we deprecate FileStore in Quincy?
On Thu, Jun 3, 2021 at 2:34 AM Ansgar Jazdzewski wrote: > > Hi folks, > > I'm fine with dropping Filestore in the R release! > Only one thing to add is: please add a warning to all versions we can upgrade > from to the R release son not only Quincy but also pacific! Sure! - Neha > > Thanks, > Ansgar > > Neha Ojha schrieb am Di., 1. Juni 2021, 21:24: >> >> Hello everyone, >> >> Given that BlueStore has been the default and more widely used >> objectstore since quite some time, we would like to understand whether >> we can consider deprecating FileStore in our next release, Quincy and >> remove it in the R release. There is also a proposal [0] to add a >> health warning to report FileStore OSDs. >> >> We discussed this topic in the Ceph Month session today [1] and there >> were no objections from anybody on the call. I wanted to reach out to >> the list to check if there are any concerns about this or any users >> who will be impacted by this decision. >> >> Thanks, >> Neha >> >> [0] https://github.com/ceph/ceph/pull/39440 >> [1] https://pad.ceph.com/p/ceph-month-june-2021 >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: SAS vs SATA for OSD
I suspect the behavior of the controller and the behavior of the drive firmware will end up mattering more than SAS vs SATA. As always it's best if you can test it first before committing to buying a pile of them. Historically I have seen SATA drives that have performed well as far as HDDs go though. Mark On 6/3/21 4:25 PM, Dave Hall wrote: Hello, We're planning another batch of OSD nodes for our cluster. Our prior nodes have been 8 x 12TB SAS drives plus 500GB NVMe per HDD. Due to market circumstances and the shortage of drives those 12TB SAS drives are in short supply. Our integrator has offered an option of 8 x 14TB SATA drives (still Enterprise). For Ceph, will the switch to SATA carry a performance difference that I should be concerned about? Thanks. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: SAS vs SATA for OSD
Dave- These are just general observations of how SATA drives operate in storage clusters. It has been a while since I have run a storage cluster with SATA drives, but in the past I did notice that SATA drives would drop off the controllers pretty frequently. Depending on many factors, it may just be a brief outage where the drive wasn't available but recoverable, sometimes it meant going into the controller and rescanning for drives before they could be added back to the system, the worst was one chassis that would mark the drives as failed after the drive dropped off a certain number of times and the vendor could not correct the issue with a firmware update and had to replace the storage chassis. Regards, -Jamie On Thu, Jun 3, 2021 at 5:26 PM Dave Hall wrote: > Hello, > > We're planning another batch of OSD nodes for our cluster. Our prior nodes > have been 8 x 12TB SAS drives plus 500GB NVMe per HDD. Due to market > circumstances and the shortage of drives those 12TB SAS drives are in short > supply. > > Our integrator has offered an option of 8 x 14TB SATA drives (still > Enterprise). For Ceph, will the switch to SATA carry a performance > difference that I should be concerned about? > > Thanks. > > -Dave > > -- > Dave Hall > Binghamton University > kdh...@binghamton.edu > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > -- Jamie Fargen Senior Consultant jfar...@redhat.com 813-817-4430 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph Ansible fails on check if monitor initial keyring already exists
I am running the Ceph ansible script to install ceph version Stable-6.0 (Pacific). When running the sample yml file that was supplied by the github repo it runs fine up until the "ceph-mon : check if monitor initial keyring already exists" step. There it will hang for 30-40 minutes before failing. >From my understanding ceph ansible should be creating this keyring and using it for communication between monitors, so does anyone know why the playbook would have a hard time with this step? Thanks in advance! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: SAS vs SATA for OSD
Agreed. I think oh …. maybe 15-20 years ago there was often a wider difference between SAS and SATA drives, but with modern queuing etc. my sense is that there is less of an advantage. Seek and rotational latency I suspect dwarf interface differences wrt performance. The HBA may be a bigger bottleneck (and way more trouble). 500 GB NVMe seems like a lot per HDD, are you using that as WAL+DB with RGW, or as dmcache or something? Depending on your constraints, QLC flash might be more competitive than you think ;) — aad > I suspect the behavior of the controller and the behavior of the drive > firmware will end up mattering more than SAS vs SATA. As always it's best if > you can test it first before committing to buying a pile of them. > Historically I have seen SATA drives that have performed well as far as HDDs > go though. > > > Mark > > On 6/3/21 4:25 PM, Dave Hall wrote: >> Hello, >> >> We're planning another batch of OSD nodes for our cluster. Our prior nodes >> have been 8 x 12TB SAS drives plus 500GB NVMe per HDD. Due to market >> circumstances and the shortage of drives those 12TB SAS drives are in short >> supply. >> >> Our integrator has offered an option of 8 x 14TB SATA drives (still >> Enterprise). For Ceph, will the switch to SATA carry a performance >> difference that I should be concerned about? >> >> Thanks. >> >> -Dave >> >> -- >> Dave Hall >> Binghamton University >> kdh...@binghamton.edu >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: SAS vs SATA for OSD - WAL+DB sizing.
FWIW, those guidelines try to be sort of a one-size-fits-all recommendation that may not apply to your situation. Typically RBD has pretty low metadata overhead so you can get away with smaller DB partitions. 4% should easily be enough. If you are running heavy RGW write workloads with small objects, you will almost certainly use more than 4% for metadata (I've seen worst case up to 50%, but that was before column family sharding which should help to some extent). Having said that, bluestore will roll the higher rocksdb levels over to the slow device and keep the wall, L0, and other lower LSM levels on the fast device. It's not necessarily the end of the world if you end up with some of the more rarely used metadata on the HDD but having it on flash certain is nice. Mark On 6/3/21 5:18 PM, Dave Hall wrote: Anthony, I had recently found a reference in the Ceph docs that indicated something like 40GB per TB for WAL+DB space. For a 12TB HDD that comes out to 480GB. If this is no longer the guideline I'd be glad to save a couple dollars. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu On Thu, Jun 3, 2021 at 6:10 PM Anthony D'Atri wrote: Agreed. I think oh …. maybe 15-20 years ago there was often a wider difference between SAS and SATA drives, but with modern queuing etc. my sense is that there is less of an advantage. Seek and rotational latency I suspect dwarf interface differences wrt performance. The HBA may be a bigger bottleneck (and way more trouble). 500 GB NVMe seems like a lot per HDD, are you using that as WAL+DB with RGW, or as dmcache or something? Depending on your constraints, QLC flash might be more competitive than you think ;) — aad I suspect the behavior of the controller and the behavior of the drive firmware will end up mattering more than SAS vs SATA. As always it's best if you can test it first before committing to buying a pile of them. Historically I have seen SATA drives that have performed well as far as HDDs go though. Mark On 6/3/21 4:25 PM, Dave Hall wrote: Hello, We're planning another batch of OSD nodes for our cluster. Our prior nodes have been 8 x 12TB SAS drives plus 500GB NVMe per HDD. Due to market circumstances and the shortage of drives those 12TB SAS drives are in short supply. Our integrator has offered an option of 8 x 14TB SATA drives (still Enterprise). For Ceph, will the switch to SATA carry a performance difference that I should be concerned about? Thanks. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: SAS vs SATA for OSD - WAL+DB sizing.
In releases before … Pacific I think, there are certain discrete capacities that DB will actually utilize: the sum of RocksDB levels. Lots of discussion in the archives. AIUI in those releases, with a 500 GB BlueStore WAL+DB device, you’ll with default settings only actually use ~~300 GB most of the time, though the extra might accelerate compaction. With Pacific I believe code was merged that shards OSD RocksDB to make better use of arbitrary partition / devices sizes. With older releases one can (or so I’ve read) game this a bit by carefully adjusting rocksdb.max-bytes-for-level-base; ISTR that Karan did that for his impressive 10 Billion Object exercise. I’ve seen threads on the list over the past couple of years that seemed to show spillover despite the DB device not being fully utilized; I hope that’s since been addressed. My understanding is that with column sharding, compaction only takes place on a fraction of the DB at any one time, so the transient space used for it (and thus prone to spillover) should be lessened. I may of course be out of my Vulcan mind, but HTH. — aad > On Jun 3, 2021, at 5:29 PM, Dave Hall wrote: > > Mark, > > We are running a mix of RGW, RDB, and CephFS. Our CephFS is pretty big, > but we're moving a lot of it to RGW. What prompted me to go looking for a > guideline was a high frequency of Spillover warnings as our cluster filled > up past the 50% mark. That was with 14.2.9, I think. I understand that > some things have changed since, but I think I'd like to have the > flexibility and performance of a generous WAL+DB - the cluster is used to > store research data, and the usage pattern is tending to change as the > research evolves. No telling what our mix will be a year from now. > > -Dave > > -- > Dave Hall > Binghamton University > kdh...@binghamton.edu > 607-760-2328 (Cell) > 607-777-4641 (Office) > > > On Thu, Jun 3, 2021 at 7:39 PM Mark Nelson wrote: > >> FWIW, those guidelines try to be sort of a one-size-fits-all >> recommendation that may not apply to your situation. Typically RBD has >> pretty low metadata overhead so you can get away with smaller DB >> partitions. 4% should easily be enough. If you are running heavy RGW >> write workloads with small objects, you will almost certainly use more >> than 4% for metadata (I've seen worst case up to 50%, but that was >> before column family sharding which should help to some extent). Having >> said that, bluestore will roll the higher rocksdb levels over to the >> slow device and keep the wall, L0, and other lower LSM levels on the >> fast device. It's not necessarily the end of the world if you end up >> with some of the more rarely used metadata on the HDD but having it on >> flash certain is nice. >> >> >> Mark >> >> >> On 6/3/21 5:18 PM, Dave Hall wrote: >>> Anthony, >>> >>> I had recently found a reference in the Ceph docs that indicated >> something >>> like 40GB per TB for WAL+DB space. For a 12TB HDD that comes out to >>> 480GB. If this is no longer the guideline I'd be glad to save a couple >>> dollars. >>> >>> -Dave >>> >>> -- >>> Dave Hall >>> Binghamton University >>> kdh...@binghamton.edu >>> >>> On Thu, Jun 3, 2021 at 6:10 PM Anthony D'Atri >>> wrote: >>> Agreed. I think oh …. maybe 15-20 years ago there was often a wider difference between SAS and SATA drives, but with modern queuing etc. my sense is that there is less of an advantage. Seek and rotational >> latency I suspect dwarf interface differences wrt performance. The HBA may be a bigger bottleneck (and way more trouble). 500 GB NVMe seems like a lot per HDD, are you using that as WAL+DB with RGW, or as dmcache or something? Depending on your constraints, QLC flash might be more competitive than you think ;) — aad > I suspect the behavior of the controller and the behavior of the drive firmware will end up mattering more than SAS vs SATA. As always it's >> best if you can test it first before committing to buying a pile of them. Historically I have seen SATA drives that have performed well as far as HDDs go though. > > Mark > > On 6/3/21 4:25 PM, Dave Hall wrote: >> Hello, >> >> We're planning another batch of OSD nodes for our cluster. Our prior nodes >> have been 8 x 12TB SAS drives plus 500GB NVMe per HDD. Due to market >> circumstances and the shortage of drives those 12TB SAS drives are in short >> supply. >> >> Our integrator has offered an option of 8 x 14TB SATA drives (still >> Enterprise). For Ceph, will the switch to SATA carry a performance >> difference that I should be concerned about? >> >> Thanks. >> >> -Dave >> >> -- >> Dave Hall >> Binghamton University >> kdh...@binghamton.edu >> ___ >> ceph-users mailing list --
[ceph-users] OSD Won't Start - LVM IOCTL Error - Read-only
Hello, I had an OSD drop out a couple days ago. This is 14.2.16, Bluestore, HDD + NVMe, non-container. The HDD sort of went away. I powered down the node, reseated the drive, and it came back. However, the OSD won't start. Systemctl --failed shows that the lvm2 pvscan failed, preventing the OSD unit from starting. Running the pvscan activate command manually with with verbose gave 'device-mapper: reload ioctl on (253:7) failed: Read-only file system'. I have been looking at this for a while, but I can't figure out what is read-only that is causing the problem. The full output of the pvscan is: # pvscan --cache --activate ay --verbose '8:48' pvscan devices on command line. activation/auto_activation_volume_list configuration setting not defined: All logical volumes will be auto-activated. Activating logical volume ceph-block-b1fea172-71a4-463e-a3e3-8cdcc1bc7b79/osd-block-425faf92-449e-4b57-98f2-a90a7f60e2a4. activation/volume_list configuration setting not defined: Checking only host tags for ceph-block-b1fea172-71a4-463e-a3e3-8cdcc1bc7b79/osd-block-425faf92-449e-4b57-98f2-a90a7f60e2a4. Creating ceph--block--b1fea172--71a4--463e--a3e3--8cdcc1bc7b79-osd--block--425faf92--449e--4b57--98f2--a90a7f60e2a4 Loading table for ceph--block--b1fea172--71a4--463e--a3e3--8cdcc1bc7b79-osd--block--425faf92--449e--4b57--98f2--a90a7f60e2a4 (253:7). device-mapper: reload ioctl on (253:7) failed: Read-only file system Removing ceph--block--b1fea172--71a4--463e--a3e3--8cdcc1bc7b79-osd--block--425faf92--449e--4b57--98f2--a90a7f60e2a4 (253:7) Activated 0 logical volumes in volume group ceph-block-b1fea172-71a4-463e-a3e3-8cdcc1bc7b79. 0 logical volume(s) in volume group "ceph-block-b1fea172-71a4-463e-a3e3-8cdcc1bc7b79" now active ceph-block-b1fea172-71a4-463e-a3e3-8cdcc1bc7b79: autoactivation failed. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] SAS vs SATA for OSD
Hello, We're planning another batch of OSD nodes for our cluster. Our prior nodes have been 8 x 12TB SAS drives plus 500GB NVMe per HDD. Due to market circumstances and the shortage of drives those 12TB SAS drives are in short supply. Our integrator has offered an option of 8 x 14TB SATA drives (still Enterprise). For Ceph, will the switch to SATA carry a performance difference that I should be concerned about? Thanks. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: SAS vs SATA for OSD - WAL+DB sizing.
Anthony, I had recently found a reference in the Ceph docs that indicated something like 40GB per TB for WAL+DB space. For a 12TB HDD that comes out to 480GB. If this is no longer the guideline I'd be glad to save a couple dollars. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu On Thu, Jun 3, 2021 at 6:10 PM Anthony D'Atri wrote: > > Agreed. I think oh …. maybe 15-20 years ago there was often a wider > difference between SAS and SATA drives, but with modern queuing etc. my > sense is that there is less of an advantage. Seek and rotational latency I > suspect dwarf interface differences wrt performance. The HBA may be a > bigger bottleneck (and way more trouble). > > 500 GB NVMe seems like a lot per HDD, are you using that as WAL+DB with > RGW, or as dmcache or something? > > Depending on your constraints, QLC flash might be more competitive than > you think ;) > > — aad > > > > I suspect the behavior of the controller and the behavior of the drive > firmware will end up mattering more than SAS vs SATA. As always it's best > if you can test it first before committing to buying a pile of them. > Historically I have seen SATA drives that have performed well as far as > HDDs go though. > > > > > > Mark > > > > On 6/3/21 4:25 PM, Dave Hall wrote: > >> Hello, > >> > >> We're planning another batch of OSD nodes for our cluster. Our prior > nodes > >> have been 8 x 12TB SAS drives plus 500GB NVMe per HDD. Due to market > >> circumstances and the shortage of drives those 12TB SAS drives are in > short > >> supply. > >> > >> Our integrator has offered an option of 8 x 14TB SATA drives (still > >> Enterprise). For Ceph, will the switch to SATA carry a performance > >> difference that I should be concerned about? > >> > >> Thanks. > >> > >> -Dave > >> > >> -- > >> Dave Hall > >> Binghamton University > >> kdh...@binghamton.edu > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: SAS vs SATA for OSD - WAL+DB sizing.
Mark, We are running a mix of RGW, RDB, and CephFS. Our CephFS is pretty big, but we're moving a lot of it to RGW. What prompted me to go looking for a guideline was a high frequency of Spillover warnings as our cluster filled up past the 50% mark. That was with 14.2.9, I think. I understand that some things have changed since, but I think I'd like to have the flexibility and performance of a generous WAL+DB - the cluster is used to store research data, and the usage pattern is tending to change as the research evolves. No telling what our mix will be a year from now. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On Thu, Jun 3, 2021 at 7:39 PM Mark Nelson wrote: > FWIW, those guidelines try to be sort of a one-size-fits-all > recommendation that may not apply to your situation. Typically RBD has > pretty low metadata overhead so you can get away with smaller DB > partitions. 4% should easily be enough. If you are running heavy RGW > write workloads with small objects, you will almost certainly use more > than 4% for metadata (I've seen worst case up to 50%, but that was > before column family sharding which should help to some extent). Having > said that, bluestore will roll the higher rocksdb levels over to the > slow device and keep the wall, L0, and other lower LSM levels on the > fast device. It's not necessarily the end of the world if you end up > with some of the more rarely used metadata on the HDD but having it on > flash certain is nice. > > > Mark > > > On 6/3/21 5:18 PM, Dave Hall wrote: > > Anthony, > > > > I had recently found a reference in the Ceph docs that indicated > something > > like 40GB per TB for WAL+DB space. For a 12TB HDD that comes out to > > 480GB. If this is no longer the guideline I'd be glad to save a couple > > dollars. > > > > -Dave > > > > -- > > Dave Hall > > Binghamton University > > kdh...@binghamton.edu > > > > On Thu, Jun 3, 2021 at 6:10 PM Anthony D'Atri > > wrote: > > > >> Agreed. I think oh …. maybe 15-20 years ago there was often a wider > >> difference between SAS and SATA drives, but with modern queuing etc. my > >> sense is that there is less of an advantage. Seek and rotational > latency I > >> suspect dwarf interface differences wrt performance. The HBA may be a > >> bigger bottleneck (and way more trouble). > >> > >> 500 GB NVMe seems like a lot per HDD, are you using that as WAL+DB with > >> RGW, or as dmcache or something? > >> > >> Depending on your constraints, QLC flash might be more competitive than > >> you think ;) > >> > >> — aad > >> > >> > >>> I suspect the behavior of the controller and the behavior of the drive > >> firmware will end up mattering more than SAS vs SATA. As always it's > best > >> if you can test it first before committing to buying a pile of them. > >> Historically I have seen SATA drives that have performed well as far as > >> HDDs go though. > >>> > >>> Mark > >>> > >>> On 6/3/21 4:25 PM, Dave Hall wrote: > Hello, > > We're planning another batch of OSD nodes for our cluster. Our prior > >> nodes > have been 8 x 12TB SAS drives plus 500GB NVMe per HDD. Due to market > circumstances and the shortage of drives those 12TB SAS drives are in > >> short > supply. > > Our integrator has offered an option of 8 x 14TB SATA drives (still > Enterprise). For Ceph, will the switch to SATA carry a performance > difference that I should be concerned about? > > Thanks. > > -Dave > > -- > Dave Hall > Binghamton University > kdh...@binghamton.edu > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > >>> ___ > >>> ceph-users mailing list -- ceph-users@ceph.io > >>> To unsubscribe send an email to ceph-users-le...@ceph.io > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io