[ceph-users] Periodically activating / peering on OSD add
Hi, why do I see activating followed by peering during OSD add (refill)? I did not change pg(p)_num. Is this normal? From my other clusters, I don't think that happend... Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Periodically activating / peering on OSD add
PS: It's luminous 12.2.5! Mit freundlichen Grüßen / best regards, Kevin Olbrich. 2018-07-14 15:19 GMT+02:00 Kevin Olbrich : > Hi, > > why do I see activating followed by peering during OSD add (refill)? > I did not change pg(p)_num. > > Is this normal? From my other clusters, I don't think that happend... > > Kevin > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 12.2.6 CRC errors
Hello Ceph users! Note to users, don't install new servers on Friday the 13th! We added a new ceph node on Friday and it has received the latest 12.2.6 update. I started to see CRC errors and investigated hardware issues. I have since found that it is caused by the 12.2.6 release. About 80TB copied onto this server. I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( ceph pg repair ) . This has cleared the errors. * no idea if this is a good way to fix the issue. From the bug report this issue is in the deepscrub and therefore I suppose stopping it will limit the issues. *** Can anyone tell me what to do? Downgrade seems that it won't fix the issue. Maybe remove this node and rebuild with 12.2.5 and resync data? Wait a few days for 12.2.7? Kind regards, Glen Baars This e-mail is intended solely for the benefit of the addressee(s) and any other named recipient. It is confidential and may contain legally privileged or confidential information. If you are not the recipient, any use, distribution, disclosure or copying of this e-mail is prohibited. The confidentiality and legal privilege attached to this communication is not waived or lost by reason of the mistaken transmission or delivery to you. If you have received this e-mail in error, please notify us immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.6 CRC errors
Hi Glen, about 16h ago there has been a notice on this list with subject "IMPORTANT: broken luminous 12.2.6 release in repo, do not upgrade" from Sage Weil (main developer of Ceph). Quote from this notice: "tl;dr: Please avoid the 12.2.6 packages that are currently present on download.ceph.com. We will have a 12.2.7 published ASAP (probably Monday). If you do not use bluestore or erasure-coded pools, none of the issues affect you. Details: We built 12.2.6 and pushed it to the repos Wednesday, but as that was happening realized there was a potentially dangerous regression in 12.2.5[1] that an upgrade might exacerbate. While we sorted that issue out, several people noticed the updated version in the repo and upgraded. That turned up two other regressions[2][3]. We have fixes for those, but are working on an additional fix to make the damage from [3] be transparently repaired." Regards, Uwe Am 14.07.2018 um 17:02 schrieb Glen Baars: Hello Ceph users! Note to users, don't install new servers on Friday the 13th! We added a new ceph node on Friday and it has received the latest 12.2.6 update. I started to see CRC errors and investigated hardware issues. I have since found that it is caused by the 12.2.6 release. About 80TB copied onto this server. I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( ceph pg repair ) . This has cleared the errors. * no idea if this is a good way to fix the issue. From the bug report this issue is in the deepscrub and therefore I suppose stopping it will limit the issues. *** Can anyone tell me what to do? Downgrade seems that it won't fix the issue. Maybe remove this node and rebuild with 12.2.5 and resync data? Wait a few days for 12.2.7? Kind regards, Glen Baars This e-mail is intended solely for the benefit of the addressee(s) and any other named recipient. It is confidential and may contain legally privileged or confidential information. If you are not the recipient, any use, distribution, disclosure or copying of this e-mail is prohibited. The confidentiality and legal privilege attached to this communication is not waived or lost by reason of the mistaken transmission or delivery to you. If you have received this e-mail in error, please notify us immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.6 CRC errors
Thanks Uwe, I was that on the website. Any idea if what I have done is correct? Do I now just wait? Sent from my Cyanogen phone On 14 Jul 2018 11:16 PM, Uwe Sauter wrote: Hi Glen, about 16h ago there has been a notice on this list with subject "IMPORTANT: broken luminous 12.2.6 release in repo, do not upgrade" from Sage Weil (main developer of Ceph). Quote from this notice: "tl;dr: Please avoid the 12.2.6 packages that are currently present on download.ceph.com. We will have a 12.2.7 published ASAP (probably Monday). If you do not use bluestore or erasure-coded pools, none of the issues affect you. Details: We built 12.2.6 and pushed it to the repos Wednesday, but as that was happening realized there was a potentially dangerous regression in 12.2.5[1] that an upgrade might exacerbate. While we sorted that issue out, several people noticed the updated version in the repo and upgraded. That turned up two other regressions[2][3]. We have fixes for those, but are working on an additional fix to make the damage from [3] be transparently repaired." Regards, Uwe Am 14.07.2018 um 17:02 schrieb Glen Baars: > Hello Ceph users! > > Note to users, don't install new servers on Friday the 13th! > > We added a new ceph node on Friday and it has received the latest 12.2.6 > update. I started to see CRC errors and investigated hardware issues. I have > since found that it is caused by the 12.2.6 release. About 80TB copied onto > this server. > > I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( ceph pg > repair ) . This has cleared the errors. > > * no idea if this is a good way to fix the issue. From the bug report > this issue is in the deepscrub and therefore I suppose stopping it will limit > the issues. *** > > Can anyone tell me what to do? Downgrade seems that it won't fix the issue. > Maybe remove this node and rebuild with 12.2.5 and resync data? Wait a few > days for 12.2.7? > > Kind regards, > Glen Baars > This e-mail is intended solely for the benefit of the addressee(s) and any > other named recipient. It is confidential and may contain legally privileged > or confidential information. If you are not the recipient, any use, > distribution, disclosure or copying of this e-mail is prohibited. The > confidentiality and legal privilege attached to this communication is not > waived or lost by reason of the mistaken transmission or delivery to you. If > you have received this e-mail in error, please notify us immediately. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com This e-mail is intended solely for the benefit of the addressee(s) and any other named recipient. It is confidential and may contain legally privileged or confidential information. If you are not the recipient, any use, distribution, disclosure or copying of this e-mail is prohibited. The confidentiality and legal privilege attached to this communication is not waived or lost by reason of the mistaken transmission or delivery to you. If you have received this e-mail in error, please notify us immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.6 CRC errors
On Sat, 14 Jul 2018, Glen Baars wrote: > Hello Ceph users! > > Note to users, don't install new servers on Friday the 13th! > > We added a new ceph node on Friday and it has received the latest 12.2.6 > update. I started to see CRC errors and investigated hardware issues. I > have since found that it is caused by the 12.2.6 release. About 80TB > copied onto this server. > > I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( > ceph pg repair ) . This has cleared the errors. > > * no idea if this is a good way to fix the issue. From the bug > report this issue is in the deepscrub and therefore I suppose stopping > it will limit the issues. *** > > Can anyone tell me what to do? Downgrade seems that it won't fix the > issue. Maybe remove this node and rebuild with 12.2.5 and resync data? > Wait a few days for 12.2.7? I would sit tight for now. I'm working on the right fix and hope to having something to test shortly, and possibly a release by tomorrow. There is a remaining danger is that for the objects with bad full-object digests, that a read of the entire object will throw an EIO. It's up to you whether you want to try to quiesce workloads to avoid that (to prevent corruption at higher layers) or avoid a service degradation/outage. :( Unfortunately I don't have super precise guidance as far as how likely that is. Are you using bluestore only, or is it a mix of bluestore and filestore? sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD fails to start after power failure
Hey folks, I have a Luminous 12.2.6 cluster which suffered a power failure recently. On recovery, one of my OSDs is continually crashing and restarting, with the error below: 9ae00 con 0 -3> 2018-07-15 09:50:58.313242 7f131c5a9700 10 monclient: tick -2> 2018-07-15 09:50:58.313277 7f131c5a9700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2018-07-15 09:50:28.313274) -1> 2018-07-15 09:50:58.313320 7f131c5a9700 10 log_client log_queue is 8 last_log 10 sent 0 num 8 unsent 10 sending 10 0> 2018-07-15 09:50:58.320255 7f131c5a9700 -1 /build/ceph-12.2.6/src/common/LogClient.cc: In function 'Message* LogClient::_get_mon_log_message()' thread 7f131c5a9700 time 2018-07-15 09:50:58.313336 /build/ceph-12.2.6/src/common/LogClient.cc: 294: FAILED assert(num_unsent <= log_queue.size()) I've found a few recent references to this "FAILED assert" message (assuming that's the cause of the problem), such as https://bugzilla.redhat.com/show_bug.cgi?id=1599718 and http://tracker.ceph.com/issues/18209, with the most recent occurance being 3 days ago (http://tracker.ceph.com/issues/18209#note-12). Is there any resolution to this issue, or anything I can attempt to recover? Thanks! D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD fails to start after power failure (with FAILED assert(num_unsent <= log_queue.size()) error)
Hey folks, Sorry, posting this from a second account, since for some reason my primary account doesn't seem tobeable to post to the list... I have a Luminous 12.2.6 cluster which suffered a power failure recently. On recovery, one of my OSDs is continually crashing and restarting, with the error below: 9ae00 con 0 -3> 2018-07-15 09:50:58.313242 7f131c5a9700 10 monclient: tick -2> 2018-07-15 09:50:58.313277 7f131c5a9700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2018-07-15 09:50:28.313274) -1> 2018-07-15 09:50:58.313320 7f131c5a9700 10 log_client log_queue is 8 last_log 10 sent 0 num 8 unsent 10 sending 10 0> 2018-07-15 09:50:58.320255 7f131c5a9700 -1 /build/ceph-12.2.6/src/common/LogClient.cc: In function 'Message* LogClient::_get_mon_log_message()' thread 7f131c5a9700 time 2018-07-15 09:50:58.313336 /build/ceph-12.2.6/src/common/LogClient.cc: 294: FAILED assert(num_unsent <= log_queue.size()) I've found a few recent references to this "FAILED assert" message (assuming that's the cause of the problem), such as https://bugzilla.redhat.com/show_bug.cgi?id=1599718 and http://tracker.ceph.com/issues/18209, with the most recent occurance being 3 days ago (http://tracker.ceph.com/issues/18209#note-12). Is there any resolution to this issue, or anything I can attempt to recover? Thanks! D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com