from:"David Orman"

[ceph-users] Re: Ubuntu 24.02 LTS Ceph status warning

2024-10-16 Thread David Orman

https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2040483 https://bugs.launchpad.net/ubuntu/+source/containerd-app/+bug/2065423 I wonder if you're running into fallout from the above bug. I believe a fix should be rolling out soon, according to those bugs. We ran into a multitude of seemingl

[ceph-users] Re: cephadm crush_device_class not applied

2024-10-03 Thread David Orman

I'm not sure, but that's going to break with a lot of people's Pacific specifications when they upgrade. We heavily utilize this functionality, and use different device class names for a lot of good reasons. This seems like a regression to me. David On Thu, Oct 3, 2024, at 16:20, Eugen Block w

[ceph-users] CLT meeting notes: Sep 09, 2024

2024-09-10 Thread David Orman

CLT discussion on Sep 09, 2024 19.2.0 release: * Cherry picked patch: https://github.com/ceph/ceph/pull/59492 * Approvals requested for re-runs CentOS Stream/distribution discussions ongoing * Significant implications in infrastructure for building/testing requiring ongoing discussions/work to d

[ceph-users] Re: Prefered distro for Ceph

2024-09-05 Thread David Orman

Not at all, you're doing the right thing. That's exactly how I would do things if I were setting out to deploy Ceph on bare metal today. Pick a very stable underlying distribution and run Ceph in containers. That's exactly what I'm doing on a massive scale, and it's been one of the best decision

[ceph-users] Re: Pull failed on cluster upgrade

2024-08-06 Thread David Orman

What operating system/distribution are you running? What hardware? David On Tue, Aug 6, 2024, at 02:20, Nicola Mori wrote: > I think I found the problem. Setting the cephadm log level to debug and > then watching the logs during the upgrade: > >ceph config set mgr mgr/cephadm/log_to_cluster_

[ceph-users] Re: Recoveries without any misplaced objects?

2024-04-24 Thread David Orman

gt; >> On Apr 24, 2024, at 15:37, David Orman wrote: >> >> Did you ever figure out what was happening here? >> >> David >> >> On Mon, May 29, 2023, at 07:16, Hector Martin wrote: >>> On 29/05/2023 20.55, Anthony D'Atri wrote: >>>&g

[ceph-users] Re: Recoveries without any misplaced objects?

2024-04-24 Thread David Orman

Did you ever figure out what was happening here? David On Mon, May 29, 2023, at 07:16, Hector Martin wrote: > On 29/05/2023 20.55, Anthony D'Atri wrote: >> Check the uptime for the OSDs in question > > I restarted all my OSDs within the past 10 days or so. Maybe OSD > restarts are somehow breakin

[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread David Orman

I would suggest that you might consider EC vs. replication for index data, and the latency implications. There's more than just the nvme vs. rotational discussion to entertain, especially if using the more widely spread EC modes like 8+3. It would be worth testing for your particular workload.

[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-07 Thread David Orman

That tracker's last update indicates it's slated for inclusion. On Thu, Feb 1, 2024, at 10:47, Zakhar Kirpichenko wrote: > Hi, > > Please consider not leaving this behind: > https://github.com/ceph/ceph/pull/55109 > > It's a serious bug, which potentially affects a whole node stability if > the

[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems

2024-02-05 Thread David Orman

Hi, Just looking back through PyO3 issues, it would appear this functionality was never supported: https://github.com/PyO3/pyo3/issues/3451 https://github.com/PyO3/pyo3/issues/576 It just appears attempting to use this functionality (which does not work/exist) wasn't successfully prevented pre

[ceph-users] Re: RFI: Prometheus, Etc, Services - Optimum Number To Run

2024-01-21 Thread David Orman

The "right" way to do this is to not run your metrics system on the cluster you want to monitor. Use the provided metrics via the exporter and ingest them using your own system (ours is Mimir/Loki/Grafana + related alerting), so if you have failures of nodes/etc you still have access to, at a mi

[ceph-users] CLT Meeting Minutes 2024-01-03

2024-01-03 Thread David Orman

Happy 2024! Today's CLT meeting covered the following: 1. 2024 brings a focus on performance of Crimson (some information here: https://docs.ceph.com/en/reef/dev/crimson/crimson/ ) 1. Status is available here: https://github.com/ceph/ceph.io/pull/635 2. There will be a new Crimson perform

[ceph-users] Re: RadosGW public HA traffic - best practices?

2023-11-17 Thread David Orman

eouts will likely happens, so the impact won't be non-zero, but it also won't be catastrophic. David On Fri, Nov 17, 2023, at 10:09, David Orman wrote: > Use BGP/ECMP with something like exabgp on the haproxy servers. > > David > > On Fri, Nov 17, 2023, at 04:09, Boris Behren

[ceph-users] Re: RadosGW public HA traffic - best practices?

2023-11-17 Thread David Orman

Use BGP/ECMP with something like exabgp on the haproxy servers. David On Fri, Nov 17, 2023, at 04:09, Boris Behrens wrote: > Hi, > I am looking for some experience on how people make their RGW public. > > Currently we use the follow: > 3 IP addresses that get distributed via keepalived between th

[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-08 Thread David Orman

I would suggest updating: https://tracker.ceph.com/issues/59580 We did notice it with 16.2.13, as well, after upgrading from .10, so likely in-between those two releases. David On Fri, Sep 8, 2023, at 04:00, Loïc Tortay wrote: > On 07/09/2023 21:33, Mark Nelson wrote: >> Hi Rok, >> >> We're st

[ceph-users] Re: MGR Memory Leak in Restful

2023-09-08 Thread David Orman

Hi, I do not believe this is actively being worked on, but there is a tracker open, if you can submit an update it may help attract attention/develop a fix: https://tracker.ceph.com/issues/59580 David On Fri, Sep 8, 2023, at 03:29, Chris Palmer wrote: > I first posted this on 17 April but did

[ceph-users] Re: OSDs spam log with scrub starts

2023-08-31 Thread David Orman

https://github.com/ceph/ceph/pull/48070 may be relevant. I think this may have gone out in 16.2.11. I would tend to agree, personally this feels quite noisy at default logging levels for production clusters. David On Thu, Aug 31, 2023, at 11:17, Zakhar Kirpichenko wrote: > This is happening to

[ceph-users] Re: Another Pacific point release?

2023-07-17 Thread David Orman

I'm hoping to see at least one more, if not more than that, but I have no crystal ball. I definitely support this idea, and strongly suggest it's given some thought. There have been a lot of delays/missed releases due to all of the lab issues, and it's significantly impacted the release cadence

[ceph-users] Re: Slow recovery on Quincy

2023-05-22 Thread David Orman

Someone who's got data regarding this should file a bug report, it sounds like a quick fix for defaults if this holds true. On Sat, May 20, 2023, at 00:59, Hector Martin wrote: > On 17/05/2023 03.07, 胡玮文 wrote: >> Hi Sake, >> >> We are experiencing the same. I set “osd_mclock_cost_per_byte_usec

[ceph-users] Re: ceph pg stuck - missing on 1 osd how to proceed

2023-04-18 Thread David Orman

You may want to consider disabling deep scrubs and scrubs while attempting to complete a backfill operation. On Tue, Apr 18, 2023, at 01:46, Eugen Block wrote: > I didn't mean you should split your PGs now, that won't help because > there is already backfilling going on. I would revert the pg_n

[ceph-users] Re: Issue upgrading 17.2.0 to 17.2.5

2023-03-06 Thread David Orman

I've seen what appears to be the same post on Reddit, previously, and attempted to assist. My suspicion is a "stop" command was passed to ceph orch upgrade in an attempt to stop it, but with the --image flag preceding it, setting the image to stop. I asked the user to do an actual upgrade stop,

[ceph-users] Re: Undo "radosgw-admin bi purge"

2023-02-22 Thread David Orman

If it's a test cluster, you could try: root@ceph01:/# radosgw-admin bucket check -h |grep -A1 check-objects --check-objects bucket check: rebuilds bucket index according to actual objects state On Wed, Feb 22, 2023, at 02:22, Robert Sander wrote: > On 21

[ceph-users] Re: Replacing OSD with containerized deployment

2023-01-31 Thread David Orman

MB, 10 MiB) copied, 0.0675647 s, 155 MB/s >>> --> Zapping successful for: >>> >>> >>> root@ceph-a2-01:/# ceph orch device ls >>> >>> ceph-a1-06 /dev/sdm hdd TOSHIBA_X_X 16.0T 21m ago *locked* >>> >>> >>> It shows l

[ceph-users] Re: Replacing OSD with containerized deployment

2023-01-30 Thread David Orman

ph-Dashboard. > > > pgs: 3236 active+clean > > > This is the new disk shown as locked (because unzapped at the moment). > > # ceph orch device ls > > ceph-a1-06 /dev/sdm hdd TOSHIBA_X_X 16.0T 9m ago > locked > > > Best &g

[ceph-users] Re: Replacing OSD with containerized deployment

2023-01-29 Thread David Orman

What does "ceph orch osd rm status" show before you try the zap? Is your cluster still backfilling to the other OSDs for the PGs that were on the failed disk? David On Fri, Jan 27, 2023, at 03:25, mailing-lists wrote: > Dear Ceph-Users, > > i am struggling to replace a disk. My ceph-cluster is

[ceph-users] Re: Current min_alloc_size of OSD?

2023-01-13 Thread David Orman

I think this would be valuable to have easily accessible during runtime, perhaps submit a report (and patch if possible)? David On Fri, Jan 13, 2023, at 08:14, Robert Sander wrote: > Hi, > > Am 13.01.23 um 14:35 schrieb Konstantin Shalygin: > > > ceph-kvstore-tool bluestore-kv /var/lib/ceph/os

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman

or > everything, but there must be numerous Ceph sites with hundreds of OSD nodes, > so I'm a bit surprised this isn't more automated... > > Cheers, > > Erik > > -- > Erik Lindahl > On 10 Jan 2023 at 00:09 +0100, Anthony D'Atri , wrote: > &g

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman

data losses, but for us we figured > it's worth replacing a few outlier drives to sleep better. > > Cheers, > > Erik > > -- > Erik Lindahl > On 9 Jan 2023 at 23:06 +0100, David Orman , wrote: > > "dmesg" on all the linux hosts and look for

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman

"dmesg" on all the linux hosts and look for signs of failing drives. Look at smart data, your HBAs/disk controllers, OOB management logs, and so forth. If you're seeing scrub errors, it's probably a bad disk backing an OSD or OSDs. Is there a common OSD in the PGs you've run the repairs on? On

[ceph-users] Ceph Leadership Team Meeting - 2022/01/04

2023-01-04 Thread David Orman

Today's CLT meeting had the following topics of discussion: * Docs questions * crushtool options could use additional documentation * This is being addressed * sticky header on documentation pages obscuring titles when anchor links are used * There will be a follow-up email solic

[ceph-users] CLT meeting summary 2022-09-21

2022-09-22 Thread David Orman

This was a short meeting, and in summary: * Testing of upgrades for 17.2.4 in Gibba commenced and slowness during upgrade has been investigated. * Workaround available; not a release blocker ___ ceph-users mailing list -- ceph-users@ceph.io To unsub

[ceph-users] Re: Wide variation in osd_mclock_max_capacity_iops_hdd

2022-09-06 Thread David Orman

Yes. Rotational drives can generally do 100-200IOPS (some outliers, of course). Do you have all forms of caching disabled on your storage controllers/disks? On Tue, Sep 6, 2022 at 11:32 AM Vladimir Brik < vladimir.b...@icecube.wisc.edu> wrote: > Setting osd_mclock_force_run_benchmark_on_init to t

[ceph-users] Re: Cephadm old spec Feature `crush_device_class` is not supported

2022-08-04 Thread David Orman

https://github.com/ceph/ceph/pull/46480 - you can see the backports/dates there. Perhaps it isn't in the version you're running? On Thu, Aug 4, 2022 at 7:51 AM Kenneth Waegeman wrote: > Hi all, > > I’m trying to deploy this spec: > > spec: > data_devices: > model: Dell Ent NVMe AGN MU U.2

[ceph-users] Re: PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-15 Thread David Orman

Apologies, backport link should be: https://github.com/ceph/ceph/pull/46845 On Fri, Jul 15, 2022 at 9:14 PM David Orman wrote: > I think you may have hit the same bug we encountered. Cory submitted a > fix, see if it fits what you've encountered: > > https://github.com/cep

[ceph-users] Re: PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-15 Thread David Orman

I think you may have hit the same bug we encountered. Cory submitted a fix, see if it fits what you've encountered: https://github.com/ceph/ceph/pull/46727 (backport to Pacific here: https://github.com/ceph/ceph/pull/46877 ) https://tracker.ceph.com/issues/54172 On Fri, Jul 15, 2022 at 8:52 AM We

[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-13 Thread David Orman

Is this something that makes sense to do the 'quick' fix on for the next pacific release to minimize impact to users until the improved iteration can be implemented? On Tue, Jul 12, 2022 at 6:16 AM Igor Fedotov wrote: > Hi Dan, > > I can confirm this is a regression introduced by > https://githu

[ceph-users] Ceph Leadership Team Meeting Minutes (2022-07-06)

2022-07-06 Thread David Orman

Here are the main topics of discussion during the CLT meeting today: - make-check/API tests - Ignoring the doc/ directory would skip an expensive git checkout operation and save time - Stale PRs - Currently an issue with stalebot which is being investigated - Cephalocon

[ceph-users] Re: Set device-class via service specification file

2022-06-27 Thread David Orman

Hi Robert, We had the same question and ended up creating a PR for this: https://github.com/ceph/ceph/pull/46480 - there are backports, as well, so I'd expect it will be in the next release or two. David On Mon, Jun 27, 2022 at 8:07 AM Robert Reihs wrote: > Hi, > We are setting up a test clust

[ceph-users] Re: OSDs getting OOM-killed right after startup

2022-06-10 Thread David Orman

Are you thinking it might be a permutation of: https://tracker.ceph.com/issues/53729 ? There are some posts in it to check for the issue, #53 and #65 had a few potential ways to check. On Fri, Jun 10, 2022 at 5:32 AM Marius Leustean wrote: > Did you check the mempools? > > ceph daemon osd.X dump

[ceph-users] Re: OpenStack Swift on top of CephFS

2022-06-09 Thread David Orman

I agree with this, just because you can doesn't mean you should. It will likely be significantly less painful to upgrade the infrastructure to support doing this the more-correct way, vs. trying to layer swift on top of cephfs. I say this having a lot of personal experience with Swift at extremely

[ceph-users] Re: Slow delete speed through the s3 API

2022-06-03 Thread David Orman

Is your client using the DeleteObjects call to delete 1000 per request?: https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html On Fri, Jun 3, 2022 at 9:35 AM J-P Methot wrote: > Read/writes are super fast. It's only deletes that are incredibly slow, > both through the s3 api and

[ceph-users] Re: Replacing OSD with DB on shared NVMe

2022-05-25 Thread David Orman

In your example, you can login to the server in question with the OSD, and run "ceph-volume lvm zap --osd-id --destroy" and it will purge the DB/WAL LV. You don't need to reapply your osd spec, it will detect the available space on the nvme and redploy that OSD. On Wed, May 25, 2022 at 3:37 PM Ed

[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-17 Thread David Orman

at was the largest cluster that you upgraded that didn't exhibit the new > issue in 16.2.8 ? Thanks. > > Respectfully, > > *Wes Dillingham* > w...@wesdillingham.com > LinkedIn <http://www.linkedin.com/in/wesleydillingham> > > > On Tue, May 17, 2022 at 10:24 AM

[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-17 Thread David Orman

We had an issue with our original fix in 45963 which was resolved in https://github.com/ceph/ceph/pull/46096. It includes the fix as well as handling for upgraded clusters. This is in the 16.2.8 release. I'm not sure if it will resolve your problem (or help mitigate it) but it would be worth trying

[ceph-users] Re: Recommendations on books

2022-04-27 Thread David Orman

Hi, I don't have any book suggestions, but in my experience, the best way to learn is to set up a cluster and start intentionally breaking things, and see how you can fix them. Perform upgrades, add load, etc. I do suggest starting with Pacific (the upcoming 16.2.8 release would likely be a good

[ceph-users] Re: [EXTERNAL] Re: radosgw-admin bi list failing with Input/output error

2022-04-21 Thread David Orman

https://tracker.ceph.com/issues/51429 with https://github.com/ceph/ceph/pull/45088 for Octopus. We're also working on: https://tracker.ceph.com/issues/55324 which is somewhat related in a sense. On Thu, Apr 21, 2022 at 11:19 AM Guillaume Nobiron wrote: > Yes, all the buckets in the reshard list

[ceph-users] Re: radosgw-admin bi list failing with Input/output error

2022-04-21 Thread David Orman

Is this a versioned bucket? On Thu, Apr 21, 2022 at 9:51 AM Guillaume Nobiron wrote: > Hello, > > I have on issue on my ceph cluster (octopus 15.2.16) with several buckets > raising a LARGE_OMAP_OBJECTS warning. > I found the buckets in the resharding list but ceph fails to reshard them. > > The

[ceph-users] Re: Laggy OSDs

2022-03-29 Thread David Orman

We're definitely dealing with something that sounds similar, but hard to state definitively without more detail. Do you have object lock/versioned buckets in use (especially if one started being used around the time of the slowdown)? Was this cluster always 16.2.7? What is your pool configuration

[ceph-users] Re: [RGW] Too much index objects and OMAP keys on them

2022-03-25 Thread David Orman

Hi Gilles, Did you ever figure this out? Also, your rados ls output indicates that the prod cluster has fewer objects in the index pool than the backup cluster, or am I misreading this? David On Wed, Dec 1, 2021 at 4:32 AM Gilles Mocellin < gilles.mocel...@nuagelibre.org> wrote: > Hello, > > We

[ceph-users] Re: Cephadm is stable or not in product?

2022-03-08 Thread David Orman

We use it without major issues, at this point. There are still flaws, but there are flaws in almost any deployment and management system, and this is not unique to cephadm. I agree with the general sentiment that you need to have some knowledge about containers, however. I don't think that's necess

[ceph-users] Re: RGW automation encryption - still testing only?

2022-02-08 Thread David Orman

t; include that in the quincy release - and if not, we'll backport it to > > quincy in an early point release > > > > can SSE-S3 with PutBucketEncryption satisfy your use case? > > > > [1] > https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSide

[ceph-users] RGW automation encryption - still testing only?

2022-02-08 Thread David Orman

Is RGW encryption for all objects at rest still testing only, and if not, which version is it considered stable in?: https://docs.ceph.com/en/latest/radosgw/encryption/#automatic-encryption-for-testing-only David ___ ceph-users mailing list -- ceph-user

[ceph-users] Re: Monitoring ceph cluster

2022-01-26 Thread David Orman

What version of Ceph are you using? Newer versions deploy a dashboard and prometheus module, which has some of this built in. It's a great start to seeing what can be done using Prometheus and the built in exporter. Once you learn this, if you decide you want something more robust, you can do an ex

[ceph-users] Re: Ideas for Powersaving on archive Cluster ?

2022-01-12 Thread David Orman

If performance isn't as big a concern, most servers have firmware settings that enable more aggressive power saving, at the cost of added latency/reduced cpu power/etc. HPE would be accessible/configurable via HP's ILO, Dells with DRAC, etc. They'd want to test and see how much of an impact it made

[ceph-users] Re: cephadm issues

2022-01-07 Thread David Orman

What are you trying to do that won't work? If you need resources from outside the container, it doesn't sound like something you should need to be entering a shell inside the container to accomplish. On Fri, Jan 7, 2022 at 1:49 PM François RONVAUX wrote: > Thanks for the answer. > > I would want

[ceph-users] Re: Repair/Rebalance slows down

2022-01-06 Thread David Orman

What's iostat show for the drive in question? What you're seeing is the cluster rebalancing initially, then at the end, it's probably that single drive being filled. I'd expect 25-100MB/s to be the fill rate of the newly added drive with backfills per osd set to 2 or so (much more than that doesn't

[ceph-users] Re: 16.2.7 pacific QE validation status, RC1 available for testing

2021-12-03 Thread David Orman

We've been testing RC1 since release on our 504 OSD/21 host, with split db/wal test cluster, and have experienced no issues on upgrade or operation so far. On Mon, Nov 29, 2021 at 11:23 AM Yuri Weinstein wrote: > Details of this release are summarized here: > > https://tracker.ceph.com/issues/53

[ceph-users] Re: Is it normal for a orch osd rm drain to take so long?

2021-12-02 Thread David Orman

.72899 > 0 > 0 B > 0 B > 0 B > 0 B > 0 B > 0 B > 0 > 0 > 1 > up > > Zach > > On 2021-12-01 5:20 PM, David Orman wrote: > > What's "ceph osd df" show? > > On Wed, Dec 1, 2021 at 2:20 PM Zach Heise (SSCC) > wrote: > >> I want

[ceph-users] Re: Is it normal for a orch osd rm drain to take so long?

2021-12-01 Thread David Orman

What's "ceph osd df" show? On Wed, Dec 1, 2021 at 2:20 PM Zach Heise (SSCC) wrote: > I wanted to swap out on existing OSD, preserve the number, and then remove > the HDD that had it (osd.14 in this case) and give the ID of 14 to a new > SSD that would be taking its place in the same node. First

[ceph-users] Re: Pg autoscaling and device_health_metrics pool pg sizing

2021-11-02 Thread David Orman

I suggest continuing with manual PG sizing for now. With 16.2.6 we have seen the autoscaler scale up the device health metrics to 16000+ PGs on brand new clusters, which we know is incorrect. It's on our company backlog to investigate, but far down the backlog. It's bitten us enough times in the pa

[ceph-users] Re: Free space in ec-pool should I worry?

2021-11-01 Thread David Orman

The balancer does a pretty good job. It's the PG autoscaler that has bitten us frequently enough that we always ensure it is disabled for all pools. David On Mon, Nov 1, 2021 at 2:08 PM Alexander Closs wrote: > I can add another 2 positive datapoints for the balancer, my personal and > work clu

[ceph-users] Re: Adopting "unmanaged" OSDs into OSD service specification

2021-10-13 Thread David Orman

still looking for a more smooth way to do that. > > Luis Domingues > > ‐‐‐ Original Message ‐‐‐ > > On Monday, October 4th, 2021 at 10:01 PM, David Orman < > orma...@corenode.com> wrote: > > > We have an older cluster which has been iterated on m

[ceph-users] Re: RFP for arm64 test nodes

2021-10-09 Thread David Orman

If there's intent to use this for performance comparisons between releases, I would propose that you include rotational drive(s), as well. It will be quite some time before everyone is running pure NVME/SSD clusters with the storage costs associated with that type of workload, and this should be re

[ceph-users] Adopting "unmanaged" OSDs into OSD service specification

2021-10-04 Thread David Orman

We have an older cluster which has been iterated on many times. It's always been cephadm deployed, but I am certain the OSD specification used has changed over time. I believe at some point, it may have been 'rm'd. So here's our current state: root@ceph02:/# ceph orch ls osd --export service_type

[ceph-users] Re: [16.2.6] When adding new host, cephadm deploys ceph image that no longer exists

2021-09-29 Thread David Orman

It appears when an updated container for 16.2.6 (there was a remoto version included with a bug in the first release) was pushed, the old one was removed from quay. We had to update our 16.2.6 clusters to the 'new' 16.2.6 version, and just did the typical upgrade with the image specified. This shou

[ceph-users] Re: prometheus - figure out which mgr (metrics endpoint) that is active

2021-09-28 Thread David Orman

We scrape all mgr endpoints since we use external Prometheus clusters, as well. The query results will have {instance=activemgrhost}. The dashboards in upstream don't have multiple cluster support, so we have to modify them to work with our deployments since we have multiple ceph clusters being pol

[ceph-users] Re: Change max backfills

2021-09-24 Thread David Orman

With recent releases, 'ceph config' is probably a better option; do keep in mind this sets things cluster-wide. If you're just wanting to target specific daemons, then tell may be better for your use case. # get current value ceph config get osd osd_max_backfills # set new value to 2, for example

[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman

https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-4b2736a28c ^^ if people want to test and provide feedback for a potential merge to EPEL8 stable. David On Wed, Sep 22, 2021 at 11:43 AM David Orman wrote: > > I'm wondering if this was installed using pip/pypi before, and now

[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman

I'm wondering if this was installed using pip/pypi before, and now switched to using EPEL? That would explain it - 1.2.1 may never have been pushed to EPEL. David On Wed, Sep 22, 2021 at 11:26 AM David Orman wrote: > > We'd worked on pushing a change to fix > https://trac

[ceph-users] Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman

cy bug, as it impacts any deployments with medium to large counts of OSDs or split db/wal devices, like many modern deployments. https://koji.fedoraproject.org/koji/packageinfo?packageID=18747 https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/p/

[ceph-users] Re: rocksdb corruption with 16.2.6

2021-09-20 Thread David Orman

Same question here, for clarity, was this on upgrading to 16.2.6 from 16.2.5? Or upgrading from some other release? On Mon, Sep 20, 2021 at 8:57 AM Sean wrote: > > I also ran into this with v16. In my case, trying to run a repair totally > exhausted the RAM on the box, and was unable to complete

[ceph-users] Re: rocksdb corruption with 16.2.6

2021-09-20 Thread David Orman

For clarity, was this on upgrading to 16.2.6 from 16.2.5? Or upgrading from some other release? On Mon, Sep 20, 2021 at 8:33 AM Paul Mezzanini wrote: > > I got the exact same error on one of my OSDs when upgrading to 16. I > used it as an exercise on trying to fix a corrupt rocksdb. A spent a fe

[ceph-users] Re: OSD based ec-code

2021-09-14 Thread David Orman

-- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > ------- > > -Original Message- > From: David Orman > Sent: Tuesday, September 14, 2021 8:55 PM > To: Eugen Block > Cc: ceph-users > Subject: [ceph-u

[ceph-users] Re: OSD based ec-code

2021-09-14 Thread David Orman

Keep in mind performance, as well. Once you start getting into higher 'k' values with EC, you've got a lot more drives involved that need to return completions for operations, and on rotational drives this becomes especially painful. We use 8+3 for a lot of our purposes, as it's a good balance of e

[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-09 Thread David Orman

No problem, and it looks like they will. Glad it worked out for you! David On Thu, Sep 9, 2021 at 9:31 AM mabi wrote: > > Thank you Eugen. Indeed the answer went to Spam :( > > So thanks to David for his workaround, it worked like a charm. Hopefully > these patches can make it into the next pac

[ceph-users] Re: Smarter DB disk replacement

2021-09-09 Thread David Orman

Exactly, we minimize the blast radius/data destruction by allocating more devices for DB/WAL of smaller size than less of larger size. We encountered this same issue on an earlier iteration of our hardware design. With rotational drives and NVMEs, we are now aiming for a 6:1 ratio based on our CRUS

[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-08 Thread David Orman

undeploy, then re-add the label, and it will redeploy. On Wed, Sep 8, 2021 at 7:03 AM David Orman wrote: > > This sounds a lot like: https://tracker.ceph.com/issues/51027 which is > fixed in https://github.com/ceph/ceph/pull/42690 > > David > > On Tue, Sep 7, 2021 a

[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-08 Thread David Orman

This sounds a lot like: https://tracker.ceph.com/issues/51027 which is fixed in https://github.com/ceph/ceph/pull/42690 David On Tue, Sep 7, 2021 at 7:31 AM mabi wrote: > > Hello > > I have a test ceph octopus 16.2.5 cluster with cephadm out of 7 nodes on > Ubuntu 20.04 LTS bare metal. I just u

[ceph-users] Re: Cephadm cannot aquire lock

2021-09-02 Thread David Orman

It may be this: https://tracker.ceph.com/issues/50526 https://github.com/alfredodeza/remoto/issues/62 Which we resolved with: https://github.com/alfredodeza/remoto/pull/63 What version of ceph are you running, and is it impacted by the above? David On Thu, Sep 2, 2021 at 9:53 AM fcid wrote: >

[ceph-users] Re: Missing OSD in SSD after disk failure

2021-08-30 Thread David Orman

> > Without success. Also tried without the "filter_logic: AND" in the yaml file > and the result was the same. > > Best regards, > Eric > > > -Original Message- > From: David Orman [mailto:orma...@corenode.com] > Sent: 27 August 2021 14:56 > To:

[ceph-users] Re: Missing OSD in SSD after disk failure

2021-08-27 Thread David Orman

This was a bug in some versions of ceph, which has been fixed: https://tracker.ceph.com/issues/49014 https://github.com/ceph/ceph/pull/39083 You'll want to upgrade Ceph to resolve this behavior, or you can use size or something else to filter if that is not possible. David On Thu, Aug 19, 2021

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-12 Thread David Orman

> > - Am 9. Aug 2021 um 18:15 schrieb David Orman orma...@corenode.com: > > > Hi, > > > > We are seeing very similar behavior on 16.2.5, and also have noticed > > that an undeploy/deploy cycle fixes things. Before we go rummaging > > through the source code

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-10 Thread David Orman

Just adding our feedback - this is affecting us as well. We reboot periodically to test durability of the clusters we run, and this is fairly impactful. I could see power loss/other scenarios in which this could end quite poorly for those with less than perfect redundancy in DCs across multiple rac

[ceph-users] Re: Ceph Pacific mon is not starting after host reboot

2021-08-09 Thread David Orman

Hi, We are seeing very similar behavior on 16.2.5, and also have noticed that an undeploy/deploy cycle fixes things. Before we go rummaging through the source code trying to determine the root cause, has anybody else figured this out? It seems odd that a repeatable issue (I've seen other mailing l

[ceph-users] Re: Having issues to start more than 24 OSDs per host

2021-06-22 Thread David Orman

https://tracker.ceph.com/issues/50526 https://github.com/alfredodeza/remoto/issues/62 If you're brave (YMMV, test first non-prod), we pushed an image with the issue we encountered fixed as per above here: https://hub.docker.com/repository/docker/ormandj/ceph/tags?page=1 that you can use to install

[ceph-users] Re: Ceph Managers dieing?

2021-06-17 Thread David Orman

Hi Peter, We fixed this bug: https://tracker.ceph.com/issues/47738 recently here: https://github.com/ceph/ceph/commit/b4316d257e928b3789b818054927c2e98bb3c0d6 which should hopefully be in the next release(s). David On Thu, Jun 17, 2021 at 12:13 PM Peter Childs wrote: > > Found the issue in the

[ceph-users] Re: Fwd: Re: Ceph osd will not start.

2021-06-01 Thread David Orman

make it clear. On Tue, Jun 1, 2021 at 2:30 AM David Orman wrote: > > I do not believe it was in 16.2.4. I will build another patched version of > the image tomorrow based on that version. I do agree, I feel this breaks new > deploys as well as existing, and hope a point release will

[ceph-users] Re: Fwd: Re: Ceph osd will not start.

2021-06-01 Thread David Orman

us since we began using it in > luminous/mimic, but situations such as this are hard to look past. It's > really unfortunate as our existing production clusters have been rock solid > thus far, but this does shake one's confidence, and I would wager that I'm > not al

[ceph-users] Re: Fwd: Re: Ceph osd will not start.

2021-05-31 Thread David Orman

on reboot the disks disappear, not stop working but not >> detected by Linux, which makes me think I'm hitting some kernel limit. >> >> At this point I'm going to cut my loses and give up and use the small >> slightly more powerful 30x drive systems I have (with 256g

[ceph-users] Re: Fwd: Re: Ceph osd will not start.

2021-05-29 Thread David Orman

You may be running into the same issue we ran into (make sure to read the first issue, there's a few mingled in there), for which we submitted a patch: https://tracker.ceph.com/issues/50526 https://github.com/alfredodeza/remoto/issues/62 If you're brave (YMMV, test first non-prod), we pushed an i

[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD

2021-05-26 Thread David Orman

We've found that after doing the osd rm, you can use: "ceph-volume lvm zap --osd-id 178 --destroy" on the server with that OSD as per: https://docs.ceph.com/en/latest/ceph-volume/lvm/zap/#removing-devices and it will clean things up so they work as expected. On Tue, May 25, 2021 at 6:51 AM Kai Sti

[ceph-users] Re: Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-14 Thread David Orman

We've created a PR to fix the root cause of this issue: https://github.com/alfredodeza/remoto/pull/63 Thank you, David On Mon, May 10, 2021 at 7:29 PM David Orman wrote: > > Hi Sage, > > We've got 2.0.27 installed. I restarted all the manager pods, just in > case, and

[ceph-users] Re: Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-10 Thread David Orman

he problem. What version are you using? The > kubic repos currently have 2.0.27. See > https://build.opensuse.org/project/show/devel:kubic:libcontainers:stable > > We'll make sure the next release has the verbosity workaround! > > sage > > On Mon, May 10, 2021 at 5:4

[ceph-users] Re: Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-10 Thread David Orman

WAL w/ 12 OSDs per NVME), even when new OSDs are not being deployed, as it still tries to apply the OSD specification. On Mon, May 10, 2021 at 4:03 PM David Orman wrote: > > Hi, > > We are seeing the mgr attempt to apply our OSD spec on the various > hosts, then block. When we inv

[ceph-users] Ceph 16.2.3 issues during upgrade from 15.2.10 with cephadm/lvm list

2021-05-10 Thread David Orman

Hi, We are seeing the mgr attempt to apply our OSD spec on the various hosts, then block. When we investigate, we see the mgr has executed cephadm calls like so, which are blocking: root 1522444 0.0 0.0 102740 23216 ?S17:32 0:00 \_ /usr/bin/python3 /var/lib/ceph/X/cep

[ceph-users] Re: Stuck OSD service specification - can't remove

2021-05-10 Thread David Orman

6.2.x. We are using 16.2.3. Thanks, David On Fri, May 7, 2021 at 9:06 AM David Orman wrote: > > Hi, > > I'm not attempting to remove the OSDs, but instead the > service/placement specification. I want the OSDs/data to persist. > --force did not work on the service, as noted

[ceph-users] Re: x-amz-request-id logging with beast + rgw (ceph 15.2.10/containerized)?

2021-05-07 Thread David Orman

ption. David On Fri, May 7, 2021 at 4:21 PM Matt Benjamin wrote: > > Hi David, > > I think the solution is most likely the ops log. It is called for > every op, and has the transaction id. > > Matt > > On Fri, May 7, 2021 at 4:58 PM David Orman wrote: > > > &g

[ceph-users] Re: x-amz-request-id logging with beast + rgw (ceph 15.2.10/containerized)?

2021-05-07 Thread David Orman

can do > that (and more) in "pacific" using lua scripting on the RGW: > https://docs.ceph.com/en/pacific/radosgw/lua-scripting/ > > Yuval > > On Thu, Apr 1, 2021 at 7:11 PM David Orman wrote: >> >> Hi, >> >> Is there any way to log the x-amz-reque

[ceph-users] Re: Stuck OSD service specification - can't remove

2021-05-07 Thread David Orman

r that everything was fine again. This is a Ceph 15.2.11 cluster on > Ubuntu 20.04 and podman. > > Hope that helps. > > ‐‐‐ Original Message ‐‐‐ > On Friday, May 7, 2021 1:24 AM, David Orman wrote: > > > Has anybody run into a 'stuck' OSD service specificatio

[ceph-users] Stuck OSD service specification - can't remove

2021-05-06 Thread David Orman

Has anybody run into a 'stuck' OSD service specification? I've tried to delete it, but it's stuck in 'deleting' state, and has been for quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3: NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd.osd_spec5

1 2 >

1 - 100 of 155 matches

Mail list logo