[ceph-users] Re: radosgw lost config during upgrade 14.2.16 -> 21
Hello, I believe you are hitting https://tracker.ceph.com/issues/50249. I've also ended up configuring my rgw instances directly using /etc/ceph/ceph.conf for the time being. Hope this helps. Arnaud On Fri, 14 May 2021 at 22:04, Jan Kasprzak wrote: > > Hello, > > I have just upgraded my cluster from 14.2.16 to 14.2.21, and after the > upgrade, radosgw was listening on the default port 7480 instead of the SSL > port > it used before the upgrade. It might be I mishandled > "ceph config assimilate-conf" previously or forgot to restart radosgw > after the assimilate-conf or something. What is the correct > way to store radosgw configuration in ceph config? > > I have the following (which I think worked previously, but I might be wrong, > e.g. forgot to restart radosgw or something): > > # ceph config dump > [...] > client.rgw. basicrgw_frontends beast ssl_port= > ssl_certificate=/etc/pki/tls/certs/.crt+bundle > ssl_private_key=/etc/pki/tls/private/.key * > > However, after rgw startup, there was the following in > /var/log/ceph/ceph-client.rgw..log: > > 2021-05-14 21:38:35.075 7f6ffd621900 1 mgrc service_daemon_register > rgw. metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph > version 14.2.21 (5ef401921d7a88aea18ec7558f7f9374ebd8f5a6) nautilus > (stable),ceph_version_short=14.2.21,cpu=AMD > ...,distro=centos,distro_description=CentOS Linux 7 > (Core),distro_version=7,frontend_config#0=beast > port=7480,frontend_type#0=beast,hostname=,kernel_description=#1 SMP > ...,kernel_version=...,mem_swap_kb=...,mem_total_kb=...,num_handles=1,os=Linux,pid=20451,zone_id=...,zone_name=default,zonegroup_id=...,zonegroup_name=default} > > (note the port=7480 and no SSL). > > After adding the following into /etc/ceph/ceph.conf on the host where > rgw is running, it started to use the correct SSL port again: > > [client.rgw.] > rgw_frontends = beast ssl_port= > ssl_certificate=/etc/pki/tls/certs/.crt+bundle > ssl_private_key=/etc/pki/tls/private/.key > > How can I configure this using "ceph config"? > Thanks, > > -Yenya > > -- > | Jan "Yenya" Kasprzak | > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > We all agree on the necessity of compromise. We just can't agree on > when it's necessary to compromise. --Larry Wall > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Arnaud Lefebvre Clever Cloud ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgrade tips from Luminous to Nautilus?
On Fri, May 14, 2021 at 09:12:07PM +0200, Mark Schouten wrote: > It seems (documentation was no longer available, so ik took some > searching) that I needed to run ceph mds deactivate $fs:$rank for every > MDS I wanted to deactivate. Ok, so that helped for one of the MDS'es. Trying to deactivate another mds, it started to release inos and dns'es, until it was almost done. When it had a 50-ish left, a client started to complain and be blacklisted until I restarted the deactivated MDS.. So no joy yet, not deactivated until a single active MDS. Any ideas to achieve that are appreciated. Thanks! -- Mark Schouten | Tuxis B.V. KvK: 74698818 | http://www.tuxis.nl/ T: +31 318 200208 | i...@tuxis.nl ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: after upgrade to 16.2.3 16.2.4 and after adding few hdd's OSD's started to fail 1 by 1.
Hi, Today I had a very similar case: 2 nvme OSDs got down and out. I had freshly installed 16.2.1 version. Before failure disks were under some load ~1.5k read IOPS + ~600 write IOPS. When they failed, nothing helped. After every trial of resterting them I was finding in logs messages containing: bluefs _allocate unable to allocate The bluestore_allocator was default i.e. hybrid. I had changed it to bitmap just like in the issue mentioned by Neha Ojha (thanks), and OSDs got in and up. Now the disks are OK, but they are under very small load. Thus, I am not certain, whether the bitmap allocator is stable. Kind regards, -- Bartosz Lis On 5/15/2021 01:10:57 CEST Igor Fedotov wrote: > This looks similar to #50656 indeed. > > Hopefully will fix that next week. > > > Thanks, > > Igor > > On 5/14/2021 9:09 PM, Neha Ojha wrote: > > On Fri, May 14, 2021 at 10:47 AM Andrius Jurkus > > > > wrote: > >> Hello, I will try to keep it sad and short :) :(PS sorry if this > >> dublicate I tried post it from web also. > >> > >> Today I upgraded from 16.2.3 to 16.2.4 and added few hosts and osds. > >> After data migration for few hours, 1 SSD failed, then another and > >> another 1 by 1. Now I have cluster in pause and 5 failed SSD's, same > >> host has SSD and HDD, but only SSD's are failing so I think this has to > >> be balancing refiling or something bug and probably not upgrade bug. > >> > >> Cluster has been in pause for 4 hours and no more OSD's are failing. > >> > >> full trace > >> https://pastebin.com/UxbfFYpb > > > > This looks very similar to https://tracker.ceph.com/issues/50656. > > Adding Igor for more ideas. [---] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CRUSH rule for EC 6+2 on 6-node cluster
Actually both our solutions don't work very well. Frequently the same OSD was chosen for multiple chunks: 8.72 9751 0 00 408955125760 0 1302 active+clean 2h 224790'12801 225410:49810 [13,1,14,11,18,2,19,13]p13[13,1,14,11,18,2,19,13]p13 2021-05-11T22:41:11.332885+ 2021-05-11T22:41:11.332885+ 8.7f 9695 0 00 406616801280 0 2184 active+clean 5h 224790'12850 225409:57529 [8,17,4,1,14,0,19,8]p8[8,17,4,1,14,0,19,8]p8 2021-05-11T22:41:11.332885+ 2021-05-11T22:41:11.332885+ I'm now considering using device classes and assigning the OSDs to either hdd1 or hdd2... Unless someone has another idea? Thanks, Bryan > On May 14, 2021, at 12:35 PM, Bryan Stillwell wrote: > > This works better than my solution. It allows the cluster to put more PGs on > the systems with more space on them: > > # for pg in $(ceph pg ls-by-pool cephfs_data_ec62 -f json | jq -r > '.pg_stats[].pgid'); do >> echo $pg >> for osd in $(ceph pg map $pg -f json | jq -r '.up[]'); do >>ceph osd find $osd | jq -r '.host' >> done | sort | uniq -c | sort -n -k1 >> done > 8.0 > 1 excalibur > 1 mandalaybay > 2 aladdin > 2 harrahs > 2 paris > 8.1 > 1 aladdin > 1 excalibur > 1 harrahs > 1 mirage > 2 mandalaybay > 2 paris > 8.2 > 1 aladdin > 1 mandalaybay > 2 harrahs > 2 mirage > 2 paris > ... > > Thanks! > Bryan > >> On May 13, 2021, at 2:58 AM, Ján Senko wrote: >> >> Caution: This email is from an external sender. Please do not click links or >> open attachments unless you recognize the sender and know the content is >> safe. Forward suspicious emails to isitbad@. >> >> >> >> Would something like this work? >> >> step take default >> step choose indep 4 type host >> step chooseleaf indep 1 type osd >> step emit >> step take default >> step choose indep 0 type host >> step chooseleaf indep 1 type osd >> step emit >> >> J. >> >> ‐‐‐ Original Message ‐‐‐ >> >> On Wednesday, May 12th, 2021 at 17:58, Bryan Stillwell >> wrote: >> >>> I'm trying to figure out a CRUSH rule that will spread data out across my >>> cluster as much as possible, but not more than 2 chunks per host. >>> >>> If I use the default rule with an osd failure domain like this: >>> >>> step take default >>> >>> step choose indep 0 type osd >>> >>> step emit >>> >>> I get clustering of 3-4 chunks on some of the hosts: >>> >>> for pg in $(ceph pg ls-by-pool cephfs_data_ec62 -f json | jq -r >>> '.pg_stats[].pgid'); do >>> === >>> echo $pg for osd in $(ceph pg map $pg -f json | jq -r '.up[]'); do ceph osd find $osd | jq -r '.host' done | sort | uniq -c | sort -n -k1 >>> >>> 8.0 >>> >>> 1 harrahs >>> >>> 3 paris >>> >>> 4 aladdin >>> >>> 8.1 >>> >>> 1 aladdin >>> >>> 1 excalibur >>> >>> 2 mandalaybay >>> >>> 4 paris >>> >>> 8.2 >>> >>> 1 harrahs >>> >>> 2 aladdin >>> >>> 2 mirage >>> >>> 3 paris >>> >>> ... >>> >>> However, if I change the rule to use: >>> >>> step take default >>> >>> step choose indep 0 type host >>> >>> step chooseleaf indep 2 type osd >>> >>> step emit >>> >>> I get the data spread across 4 hosts with 2 chunks per host: >>> >>> for pg in $(ceph pg ls-by-pool cephfs_data_ec62 -f json | jq -r >>> '.pg_stats[].pgid'); do >>> === >>> echo $pg for osd in $(ceph pg map $pg -f json | jq -r '.up[]'); do ceph osd find $osd | jq -r '.host' done | sort | uniq -c | sort -n -k1 done >>> >>> 8.0 >>> >>> 2 aladdin >>> >>> 2 harrahs >>> >>> 2 mandalaybay >>> >>> 2 paris >>> >>> 8.1 >>> >>> 2 aladdin >>> >>> 2 harrahs >>> >>> 2 mandalaybay >>> >>> 2 paris >>> >>> 8.2 >>> >>> 2 harrahs >>> >>> 2 mandalaybay >>> >>> 2 mirage >>> >>> 2 paris >>> >>> ... >>> >>> Is it possible to get the data to spread out over more hosts? I plan on >>> expanding the cluster in the near future and would like to see more hosts >>> get 1 chunk instead of 2. >>> >>> Also, before you recommend adding two more hosts and switching to a >>> host-based failure domain, the cluster is on a variety of hardware with >>> between 2-6 drives per host and drives that are 4TB-12TB in size (it's part >>> of my home lab). >>> >>> Thanks, >>> >>> Bryan >>> >>> ceph-users mailing list -- ceph-users@ceph.io >>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] After a huge amount of snaphot delete many snaptrim+snaptrim_wait pgs
Hi, The user deleted 20-30 snapshots and clones from the cluster and it seems like slows down the whole system. I’ve set the snaptrim parameters to the lowest as possible, set bufferred_io to true so at least have some speed for the user, but I can see the objects removal from the cluster is still happening, the beginning was 45 millioms, now 19millions but what I don’t understand many osds getting more full :( ? And the snaptrim is super slow, have 195 snaptrim wait and 36 snaptrim but in every 5 hours only 1 done :/? What can I do? One of the osd has been 62%, now it is 75% in 2 days and still growing. Set back the snap options or? The cluster has 3 servers, running on luminous 12.2.8. Some paste: https://jpst.it/2vw4H Thank you This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io