[ceph-users] Cephfs Kernel client not working properly without ceph cluster IP
Hi, I have mounted my cephfs (ceph octopus) thru kernel client in Debian. I get following error in "dmesg" when I try to read any file from my mount. "[ 236.429897] libceph: osd1 10.100.4.1:6891 socket closed (con state CONNECTING)" I use public IP (10.100.3.1) and cluster IP (10.100.4.1) in my ceph cluster. I think public IP is enough to mount the share and work on it but in my case, it needs me to assign public IP also to the client to work properly. Does anyone have experience in this? I have earlier also mailed the ceph-user group but I didn't get any response. So sending again not sure my mail went through. regards Amudhan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
> 在 2020年11月8日,11:30,Tony Liu 写道: > > Is it FileStore or BlueStore? With this SSD-HDD solution, is journal > or WAL/DB on SSD or HDD? My understanding is that, there is no > benefit to put journal or WAL/DB on SSD with such solution. It will > also eliminate the single point of failure when having all WAL/DB > on one SSD. Just want to confirm. We are building a new cluster, so BlueStore. I think put WAL/DB on SSD is more about performance. How this is related to eliminating single point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And of course, just use single device for SSD OSDs > Another thought is to have separate pools, like all-SSD pool and > all-HDD pool. Each pool will be used for different purpose. For example, > image, backup, object can be in all-HDD pool and VM volume can be in > all-SSD pool. Yes, I think the same. > Thanks! > Tony >> -Original Message- >> From: 胡 玮文 >> Sent: Monday, October 26, 2020 9:20 AM >> To: Frank Schilder >> Cc: Anthony D'Atri ; ceph-users@ceph.io >> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD >> replicated pool >> >> 在 2020年10月26日,15:43,Frank Schilder 写道: >>> >>> I’ve never seen anything that implies that lead OSDs within an acting >> set are a function of CRUSH rule ordering. >>> >>> This is actually a good question. I believed that I had seen/heard >> that somewhere, but I might be wrong. >>> >>> Looking at the definition of a PG, is states that a PG is an ordered >> set of OSD (IDs) and the first up OSD will be the primary. In other >> words, it seems that the lowest OSD ID is decisive. If the SSDs were >> deployed before the HDDs, they have the smallest IDs and, hence, will be >> preferred as primary OSDs. >> >> I don’t think this is correct. From my experiments, using previously >> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the >> primary OSDs are always SSD. >> >> I also have a look at the code, if I understand it correctly: >> >> * If the default primary affinity is not changed, then the logic about >> primary affinity is skipped, and the primary would be the first one >> returned by CRUSH algorithm [1]. >> >> * The order of OSDs returned by CRUSH still matters if you changed the >> primary affinity. The affinity represents the probability of a test to >> be success. The first OSD will be tested first, and will have higher >> probability to become primary. [2] >> * If any OSD has primary affinity = 1.0, the test will always success, >> and any OSD after it will never be primary. >> * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to >> 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has >> probability of 0.125. Otherwise, 1st will be primary. >> * If no test success (Suppose all OSDs have affinity of 0), 1st OSD >> will be primary as fallback. >> >> [1]: >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012&data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f640afb435%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D&reserved=0 >> 53/src/osd/OSDMap.cc#L2456 >> [2]: >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012&data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f640afb435%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D&reserved=0 >> 53/src/osd/OSDMap.cc#L2561 >> >> So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient >> for it to be the primary in my case. >> >> Do you think I should contribute these to documentation? >> >>> This, however, is not a sustainable situation. Any addition of OSDs >> will mess this up and the distribution scheme will fail in the future. A >> way out seem to be: >>> >>> - subdivide your HDD storage using device classes: >>> * define a device class for HDDs with primary affinity=0, for example, >>> pick 5 HDDs and change their device class to hdd_np (for no primary) >>> * set the primary affinity of these HDD OSDs to 0 >>> * modify your crush rule to use "step take default class hdd_np" >>> * this will create a pool with primaries on SSD and balanced storage >>> distribution between SSD and HDD >>> * all-HDD pools deployed as usual on class hdd >>> * when increasing capacity, one needs to take care of adding disks to >>> hdd_np class and set their primary affinity to 0 >>> * somewhat increased admin effort, but fully working solution >>> >>> Best regards, >>> = >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> _
[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
Sorry for confusing, what I meant to say is that "having all WAL/DB on one SSD will result a single point of failure". If that SSD goes down, all OSDs depending on it will also stop working. What I'd like to confirm is that, there is no benefit to put WAL/DB on SSD when there is either cache tire or such primary SSD with HDD for replications. And distribute WAL/DB on each HDD will eliminate that single point of failure. So in your case, with SSD as the primary OSD, do you put WAL/DB on a SSD for secondary HDDs, or just distribute it to each HDD? Thanks! Tony > -Original Message- > From: 胡 玮文 > Sent: Sunday, November 8, 2020 5:47 AM > To: Tony Liu > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] Re: The feasibility of mixed SSD and HDD > replicated pool > > > > 在 2020年11月8日,11:30,Tony Liu 写道: > > > > Is it FileStore or BlueStore? With this SSD-HDD solution, is journal > > or WAL/DB on SSD or HDD? My understanding is that, there is no benefit > > to put journal or WAL/DB on SSD with such solution. It will also > > eliminate the single point of failure when having all WAL/DB on one > > SSD. Just want to confirm. > > We are building a new cluster, so BlueStore. I think put WAL/DB on SSD > is more about performance. How this is related to eliminating single > point of failure? I’m going to deploy WAL/DB on SSD for my HDD OSDs. And > of course, just use single device for SSD OSDs > > > Another thought is to have separate pools, like all-SSD pool and > > all-HDD pool. Each pool will be used for different purpose. For > > example, image, backup, object can be in all-HDD pool and VM volume > > can be in all-SSD pool. > > Yes, I think the same. > > > Thanks! > > Tony > >> -Original Message- > >> From: 胡 玮文 > >> Sent: Monday, October 26, 2020 9:20 AM > >> To: Frank Schilder > >> Cc: Anthony D'Atri ; ceph-users@ceph.io > >> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD > >> replicated pool > >> > >> > 在 2020年10月26日,15:43,Frank Schilder 写道: > >>> > >>> > I’ve never seen anything that implies that lead OSDs within an > acting > >> set are a function of CRUSH rule ordering. > >>> > >>> This is actually a good question. I believed that I had seen/heard > >> that somewhere, but I might be wrong. > >>> > >>> Looking at the definition of a PG, is states that a PG is an ordered > >> set of OSD (IDs) and the first up OSD will be the primary. In other > >> words, it seems that the lowest OSD ID is decisive. If the SSDs were > >> deployed before the HDDs, they have the smallest IDs and, hence, will > >> be preferred as primary OSDs. > >> > >> I don’t think this is correct. From my experiments, using previously > >> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the > >> primary OSDs are always SSD. > >> > >> I also have a look at the code, if I understand it correctly: > >> > >> * If the default primary affinity is not changed, then the logic > >> about primary affinity is skipped, and the primary would be the first > >> one returned by CRUSH algorithm [1]. > >> > >> * The order of OSDs returned by CRUSH still matters if you changed > >> the primary affinity. The affinity represents the probability of a > >> test to be success. The first OSD will be tested first, and will have > >> higher probability to become primary. [2] > >> * If any OSD has primary affinity = 1.0, the test will always > >> success, and any OSD after it will never be primary. > >> * Suppose CRUSH returned 3 OSDs, each one has primary affinity set > >> to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd > >> one has probability of 0.125. Otherwise, 1st will be primary. > >> * If no test success (Suppose all OSDs have affinity of 0), 1st OSD > >> will be primary as fallback. > >> > >> [1]: > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit > >> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012 > >> &data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f > >> 640afb435%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG > >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 > >> %3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D > >> &reserved=0 > >> 53/src/osd/OSDMap.cc#L2456 > >> [2]: > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit > >> hub.com%2Fceph%2Fceph%2Fblob%2F6dc03460ffa1315e91ea21b1125200d3d5a012 > >> &data=04%7C01%7C%7C70f76045ca734515cde908d883969717%7C84df9e7fe9f > >> 640afb435%7C1%7C0%7C637404030082959169%7CUnknown%7CTWFpbG > >> Zsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 > >> %3D%7C1000&sdata=mVvAC6ptPvv9TNyCc8P2r69We7rZ8zMmHUpSPGI%2FAIc%3D > >> &reserved=0 > >> 53/src/osd/OSDMap.cc#L2561 > >> > >> So, set the primary affinity of all SSD OSDs to 1.0 should be > >> sufficient for it to be the primary in my case. > >> > >> Do you think I should contribute these to documentation? > >>
[ceph-users] Re: pg xyz is stuck undersized for long time
Hi Frank, You said only one OSD is down but in ceph status shows more than 20 OSD is down. Regards, Amudhan On Sun 8 Nov, 2020, 12:13 AM Frank Schilder, wrote: > Hi all, > > I moved the crush location of 8 OSDs and rebalancing went on happily > (misplaced objects only). Today, osd.1 crashed, restarted and rejoined the > cluster. However, it seems not to re-join some PGs it was a member of. I > have now undersized PGs for no real reason I would believe: > > PG_DEGRADED Degraded data redundancy: 52173/2268789087 objects degraded > (0.002%), 2 pgs degraded, 7 pgs undersized > pg 11.52 is stuck undersized for 663.929664, current state > active+undersized+remapped+backfilling, last acting > [237,60,2147483647,74,233,232,292,86] > > The up and acting sets are: > > "up": [ > 237, > 2, > 74, > 289, > 233, > 232, > 292, > 86 > ], > "acting": [ > 237, > 60, > 2147483647, > 74, > 233, > 232, > 292, > 86 > ], > > How can I get the PG to complete peering and osd.1 to join? I have an > unreasonable number of degraded objects where the missing part is on this > OSD. > > For completeness, here the cluster status: > > # ceph status > cluster: > id: ... > health: HEALTH_ERR > noout,norebalance flag(s) set > 1 large omap objects > 35815902/2268938858 objects misplaced (1.579%) > Degraded data redundancy: 46122/2268938858 objects degraded > (0.002%), 2 pgs degraded, 7 pgs undersized > Degraded data redundancy (low space): 28 pgs backfill_toofull > > services: > mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 > mgr: ceph-01(active), standbys: ceph-03, ceph-02 > mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay > osd: 299 osds: 275 up, 275 in; 301 remapped pgs > flags noout,norebalance > > data: > pools: 11 pools, 3215 pgs > objects: 268.8 M objects, 675 TiB > usage: 854 TiB used, 1.1 PiB / 1.9 PiB avail > pgs: 46122/2268938858 objects degraded (0.002%) > 35815902/2268938858 objects misplaced (1.579%) > 2907 active+clean > 219 active+remapped+backfill_wait > 47 active+remapped+backfilling > 28 active+remapped+backfill_wait+backfill_toofull > 6active+clean+scrubbing+deep > 5active+undersized+remapped+backfilling > 2active+undersized+degraded+remapped+backfilling > 1active+clean+scrubbing > > io: > client: 13 MiB/s rd, 196 MiB/s wr, 2.82 kop/s rd, 1.81 kop/s wr > recovery: 57 MiB/s, 14 objects/s > > Thanks and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [Suspicious newsletter] Re: Multisite sync not working - permission denied
Hi, Update to 15.2.5. We have the same issue, in the relase notes they don’t mention anything regarding multisite, but once we updated everything started to work ith 15.2.5. Best regards From: Michael Breen Sent: Friday, November 6, 2020 10:40 PM To: ceph-users@ceph.io Subject: [Suspicious newsletter] [ceph-users] Re: Multisite sync not working - permission denied Email received from outside the company. If in doubt don't click links nor open attachments! Continuing my fascinating conversation with myself: The output of radosgw-admin sync status indicates that only the metadata is a problem, i.e., the data itself is syncing, and I have confirmed that. There is no S3 access to the secondary, zone-b, so I could not check replication that way, but having created a bucket on the primary, on the secondary I did rados -p zone-b.rgw.buckets.data ls and saw the bucket had been replicated. My current suspicion is that the user problem is an effect rather than a cause of the metadata sync problem. I have also discovered a setting debug_rgw_sync which increases the debug level only for the sync code, but found nothing interesting. The additional output seemed all to relate to data rather than metadata. On Fri, 6 Nov 2020 at 11:47, Michael Breen mailto:michael.br...@vikingenterprise.com>> wrote: I forgot to mention earlier attempted debugging: I believe this is not because the keys are wrong, but because it is looking for a user that is not seen on the secondary: debug 2020-11-03T16:37:47.330+ 7f32e9859700 5 req 60 0.00386s :post_period error reading user info, uid=ACCESS can't authenticate debug 2020-11-03T16:37:47.330+ 7f32e9859700 20 req 60 0.00386s :post_period rgw::auth::s3::LocalEngine denied with reason=-2028 debug 2020-11-03T16:37:47.330+ 7f32e9859700 20 req 60 0.00386s :post_period rgw::auth::s3::AWSAuthStrategy denied with reason=-2028 debug 2020-11-03T16:37:47.330+ 7f32e9859700 5 req 60 0.00386s :post_period Failed the auth strategy, reason=-2028 debug 2020-11-03T16:37:47.330+ 7f32e9859700 10 failed to authorize request src/rgw/rgw_common.h:#define ERR_INVALID_ACCESS_KEY 2028 ./src/rgw/rgw_rest_s3.cc if (rgw_get_user_info_by_access_key(ctl->user, access_key_id, user_info) < 0) { ldpp_dout(dpp, 5) << "error reading user info, uid=" << access_key_id << " can't authenticate" << dendl; On Fri, 6 Nov 2020 at 11:38, Michael Breen mailto:michael.br...@vikingenterprise.com>> wrote: Hi, radosgw-admin -v ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Multisite sync was something I had working with a previous cluster and an earlier Ceph version, but it doesn't now, and I can't understand why. If anyone with an idea of a possible cause could give me a clue I would be grateful. I have clusters set up using Rook, but as far as I can tell, that's not a factor. On the primary cluster, I have this: radosgw-admin zonegroup get --rgw-zonegroup zonegroup-a { "id": "b115d74a-2d5f-4127-b621-0223f1e96c71", "name": "zonegroup-a", "api_name": "zonegroup-a", "is_master": "true", "endpoints": [ "http://192.168.30.8:80"; ], "hostnames": [], "hostnames_s3website": [], "master_zone": "024687e0-1461-4f45-9149-9e571791c2b3", "zones": [ { "id": "024687e0-1461-4f45-9149-9e571791c2b3", "name": "zone-a", "endpoints": [ "http://192.168.30.8:80"; ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "6ba0ee26-0155-48f9-b057-2803336f0d66", "name": "zone-b", "endpoints": [ "http://192.168.30.108:80"; ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "8c38fa05-c19d-4e30-bc98-e2bc84eccb68", "sync_policy": { "groups": [] } } It's identical on the secondary (that's after a realm pull, an update of the zone-b endpoints, and a period commit), which I double-checked by piping the output to md5sum on both sides. The system user created on the primary is radosgw-admin user info --uid realm-a-system-user { ... "keys": [ { "user": "re