date:20200519

[ceph-users] osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread thoralf schulze

hi there, we are seeing osd occasionally getting kicked out of our cluster, after having been marked down by other osds. most of the time, the affected osd rejoins the cluster after about ~5 minutes, but sometimes this takes much longer. during that time, the osd seems to run just fine. this happ

[ceph-users] RGW resharding

2020-05-19 Thread Adrian Nicolae

Hi, I have the following Ceph Mimic setup : - a bunch of old servers with 3-4 SATA drives each (74 OSDs in total) - index/leveldb is stored on each OSD (so no SSD drives, just SATA) - the current usage is : GLOBAL: SIZE AVAIL RAW USED %RAW USED 542 TiB 105 TiB

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

2020-05-19 Thread Frank Schilder

Hi Andreas, I made exactly the same observation in another scenario. I added some OSDs while other OSDs were down. This is expected. The crush map is an a-priory algorithm to compute the location of objects without contacting a central server. Hence, *any*change of a crush map while an OSD is

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread Igor Fedotov

Hi Thoralf, given the following indication from your logs: May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211 7fb25cc80700 0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn slow operation observed for _collection_list, latency = 96.337s, lat = 96s cid =2.0s2_head start

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread Paul Emmerich

On Tue, May 19, 2020 at 2:06 PM Igor Fedotov wrote: > Hi Thoralf, > > given the following indication from your logs: > > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211 > 7fb25cc80700 0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn > slow operation observed for _col

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread thoralf schulze

hi igor, hi paul - thank you for your answers. On 5/19/20 2:05 PM, Igor Fedotov wrote: > I presume that your OSDs suffer from slow RocksDB access, > collection_listing operation is a culprit in this case - 30 items > listing takes 96seconds to complete. > From my experience such issues tend to ha

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread Paul Emmerich

On Tue, May 19, 2020 at 3:11 PM thoralf schulze wrote: > > On 5/19/20 2:13 PM, Paul Emmerich wrote: > > 3) if necessary add more OSDs; common problem is having very > > few dedicated OSDs for the index pool; running the index on > > all OSDs (and having a fast DB device for every disk) is > > bet

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread Igor Fedotov

Thoralf, from your perf counter's dump: "db_total_bytes": 15032377344, "db_used_bytes": 411033600, "wal_total_bytes": 0, "wal_used_bytes": 0, "slow_total_bytes": 94737203200, "slow_used_bytes": 10714480640, slow_used_bytes is non-zero hence you ha

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread thoralf schulze

hi igor - On 5/19/20 3:23 PM, Igor Fedotov wrote: > slow_used_bytes is non-zero hence you have a spillover. you are absolutely right, we do have spillovers on a large number of osds. ceph tell osd.* compact is running right now. > Additionally your DB volume size selection isn't perfect. For op

[ceph-users] Resources for multisite deployment

2020-05-19 Thread Coding SpiderFox

Hello everyone, I'd like to setup a multisite ceph cluster. Are there any sample setups that you can recommend studying? I want to achieve fault tolerance but also I want to avoid split brain scenarios. I'm not that familiar with systems like ceph, so I would consider myself as a beginner. Thank

[ceph-users] Re: v15.2.2 Octopus released

2020-05-19 Thread David Orman

The updated images have not been pushed to Dockerhub yet. I ran into the same problem yesterday trying to update. Hopefully updated images will be pushed on release (at the same time as the tarball release/prior to announcement) moving forward in order to avoid this issue. See here for latest tags

[ceph-users] Clarification of documentation

2020-05-19 Thread CodingSpiderFox

In the docs: https://docs.ceph.com/docs/master/radosgw/multisite/ - in the section Requirements and Assumptions There is this warning hint: "Running a single Ceph storage cluster is NOT recommended unless you have low latency WAN connections." What exactly does "single Ceph storage cluster" mea

[ceph-users] Re: Zeroing out rbd image or volume

2020-05-19 Thread Ken Dreyer

On Tue, May 12, 2020 at 6:03 AM Wido den Hollander wrote: > And to add to this: No, a newly created RBD image will never have 'left > over' bits and bytes from a previous RBD image. > > I had to explain this multiple times to people which were used to old > (i)SCSI setups where partitions could ha

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread Nathan Fish

It is my understanding that it refers to running a single, normal ceph cluster with it's component hosts connected over WAN. This would require OSDs to connect to other OSDs and mons over WAN for nearly every operation, and is not likely to perform acceptably. __

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread Benjeman Meekhof

It is possible to run a ceph cluster over a WAN if you have reliable enough WAN with sites close enough for low-ish latency. The OSiRIS project is architected that way with Ceph services spread evenly across three university sites in Michigan. There's more information and contact on their website

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

2020-05-19 Thread Andras Pataki

Hi Frank, My understanding was that once a cluster is in a degraded state (an OSD is down), ceph stores all changed cluster maps until the cluster is healthy again exactly for the reason of finding missing objects. If there is a real disaster of some kind, and many OSDs go up and down at vari

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread Brian Topping

I have been running Ceph over a gigabit WAN for a few months now and have been happy with it. Mine is set up with Strongswan tunnels And dynamic routing with BIRD) (although I would have used transport Mode and iBGP in hindsight). I generally have 300-500kbps flow with 5ms latency. What I spec

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread Gregory Farnum

On Tue, May 19, 2020 at 10:34 AM Benjeman Meekhof wrote: > > It is possible to run a ceph cluster over a WAN if you have reliable > enough WAN with sites close enough for low-ish latency. The OSiRIS > project is architected that way with Ceph services spread evenly > across three university sites

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread John Zachary Dover

Greg, My name's Zac and I'm the docs guy for the Ceph Foundation. I have a long-term plan to create a document that collects error codes and failure cases, but I am only one man and it will be a few months before I can begin on it. Zac Dover Ceph Docs Guy On Wed, May 20, 2020 at 4:32 AM Gregory

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread CodingSpiderFox

Great, thanks already. I will study the publications of the project :) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread CodingSpiderFox

Zac, can you confirm that this assumption is true? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread CodingSpiderFox

What does tiebreaker monitor mean? What exactly is its purpose? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread Nathan Fish

You need a third monitor in order to form a quorum if one of the two sites goes down. With only two sites, there is no safe way for them to decide who is down. On Tue, May 19, 2020 at 3:11 PM CodingSpiderFox wrote: > > What does tiebreaker monitor mean? What exactly is its purpose? >

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

2020-05-19 Thread Frank Schilder

Hi Andreas, the cluster map and crush map are not the same thing. If you change the crush map while the cluster is in degraded state, you basically modify this history of cluster maps explicitly and have to live with the consequences (keeping history under crush map changes is limited to up+in

[ceph-users] Prometheus Python Errors

2020-05-19 Thread support

Hello Everyone, I have installed both Prometheus and Grafana on one of my manager nodes (Ubuntu 18.04), and have configured both according to the documentation. I have visible Grafana dashboards when visiting http://mon1:3000, but no data exists on the dashboard. Python errors are shown for the

[ceph-users] Re: Clarification of documentation

2020-05-19 Thread CodingSpiderFox

Hello Zac, I have some further questions on that page: Right before the section "Delete Default Zone Group and Zone" there is another warning that says: "The following steps assume a multi-site configuration using newly installed systems that aren’t storing data yet. DO NOT DELETE the default

[ceph-users] Ceph Dashboard suddenly gone and primary remote is not accessible [CEPHADM_HOST_CHECK_FAILED, CEPHADM_REFRESH_FAILED]

2020-05-19 Thread Gencer W . Genç

Hi, I was browsing dashboard today. Then suddently it stopped working and i got 502 errors. I checked via root login and see thet ceph health is down to WARN. I can access all rdb devices and CephFS. They work. All OSDs in server-1 is up. health: HEALTH_WARN 1 hosts fail cepha

[ceph-users] Re: Ceph Dashboard suddenly gone and primary remote is not accessible [CEPHADM_HOST_CHECK_FAILED, CEPHADM_REFRESH_FAILED]

2020-05-19 Thread Gencer W . Genç

Hi again, One more update: I connected to server-2 and ran ceph -s there. I got: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)') Today I created a RBD pool and created 2 RDB images in this pool. Could this be reason for all dashboard

[ceph-users] Aging in S3 or Moving old data to slow OSDs

2020-05-19 Thread Khodayar Doustar

Hi, I'm using Nautilus and I'm using the whole cluster mainly for a single bucket in RadosGW. There is a lot of data in this bucket (Petabyte scale) and I don't want to waste all of SSD on it. Is there anyway to automatically set some aging threshold for this data and e.g. move any data older than

[ceph-users] Re: Ceph Dashboard suddenly gone and primary remote is not accessible [CEPHADM_HOST_CHECK_FAILED, CEPHADM_REFRESH_FAILED]

2020-05-19 Thread David Orman

This happens (unfortunately) frequently to me. Look for the active mgr (ceph -s), and go restart the mgr service there (systemctl list-units |grep mgr then systemctl restart NAMEOFSERVICE). This normally resolves that error for me. You can look at the journalctl output and you'll likely see errors

[ceph-users] Re: What is a pgmap?

2020-05-19 Thread Bryan Henderson

Here's what I learned about PG maps from my investigation of the code. First, they don't seem to be involved in deciding what needs reconstruction when a dead OSD is revived. There is a version number stored with the PGs that is probably used for that. It looks like nothing but statistics - the

[ceph-users] Re: Pool full but the user cleaned it up already

2020-05-19 Thread Eugen Block

Hi, take a look into 'ceph osd df' (maybe share the output) to see which OSD(s) are full, they determine if when pool becomes full. Did you delete lots of objects from that pool recently? That can take some time until the space is really cleared. Zitat von "Szabo, Istvan (Agoda)" : Hi,

[ceph-users] Re: Pool full but the user cleaned it up already

2020-05-19 Thread Eugen Block

Please add 'ceph osd df' output, not 'ceph df'. Zitat von "Szabo, Istvan (Agoda)" : Hello, No, haven't deleted, this warning is quite long time ago. ceph health detail HEALTH_WARN 1 pool(s) full POOL_FULL 1 pool(s) full pool 'k8s' is full (no quota) ceph df GLOBAL: SIZE AVAIL

[ceph-users] Re: Mismatched object counts between "rados df" and "rados ls" after rbd images removal

2020-05-19 Thread icy chan

Hi Eugen, Thanks for your reply. The problem is all rbd images were removed from pool rbd days ago. i.e. both below commands also return empty: $ rbd ls rbd $ rados -p rbd listomapkeys rbd_directory But below rados df still located 430K objects. Any other methods can I dig out those ghost object

[ceph-users] osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

[ceph-users] RGW resharding

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

[ceph-users] Resources for multisite deployment

[ceph-users] Re: v15.2.2 Octopus released

[ceph-users] Clarification of documentation

[ceph-users] Re: Zeroing out rbd image or volume

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Clarification of documentation

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

[ceph-users] Prometheus Python Errors

[ceph-users] Re: Clarification of documentation

[ceph-users] Ceph Dashboard suddenly gone and primary remote is not accessible [CEPHADM_HOST_CHECK_FAILED, CEPHADM_REFRESH_FAILED]

[ceph-users] Re: Ceph Dashboard suddenly gone and primary remote is not accessible [CEPHADM_HOST_CHECK_FAILED, CEPHADM_REFRESH_FAILED]

[ceph-users] Aging in S3 or Moving old data to slow OSDs

[ceph-users] Re: Ceph Dashboard suddenly gone and primary remote is not accessible [CEPHADM_HOST_CHECK_FAILED, CEPHADM_REFRESH_FAILED]

[ceph-users] Re: What is a pgmap?

[ceph-users] Re: Pool full but the user cleaned it up already

[ceph-users] Re: Pool full but the user cleaned it up already

[ceph-users] Re: Mismatched object counts between "rados df" and "rados ls" after rbd images removal

34 matches

Site Navigation

Mail list logo

Footer information