[ceph-users] Periodically activating / peering on OSD add

2018-07-14 Thread Kevin Olbrich
Hi,

why do I see activating followed by peering during OSD add (refill)?
I did not change pg(p)_num.

Is this normal? From my other clusters, I don't think that happend...

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Periodically activating / peering on OSD add

2018-07-14 Thread Kevin Olbrich
PS: It's luminous 12.2.5!


Mit freundlichen Grüßen / best regards,
Kevin Olbrich.

2018-07-14 15:19 GMT+02:00 Kevin Olbrich :

> Hi,
>
> why do I see activating followed by peering during OSD add (refill)?
> I did not change pg(p)_num.
>
> Is this normal? From my other clusters, I don't think that happend...
>
> Kevin
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 12.2.6 CRC errors

2018-07-14 Thread Glen Baars
Hello Ceph users!

Note to users, don't install new servers on Friday the 13th!

We added a new ceph node on Friday and it has received the latest 12.2.6 
update. I started to see CRC errors and investigated hardware issues. I have 
since found that it is caused by the 12.2.6 release. About 80TB copied onto 
this server.

I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( ceph pg 
repair ) . This has cleared the errors.

* no idea if this is a good way to fix the issue. From the bug report this 
issue is in the deepscrub and therefore I suppose stopping it will limit the 
issues. ***

Can anyone tell me what to do? Downgrade seems that it won't fix the issue. 
Maybe remove this node and rebuild with 12.2.5 and resync data? Wait a few days 
for 12.2.7?

Kind regards,
Glen Baars
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.6 CRC errors

2018-07-14 Thread Uwe Sauter

Hi Glen,

about 16h ago there has been a notice on this list with subject "IMPORTANT: broken luminous 12.2.6 release in repo, do 
not upgrade" from Sage Weil (main developer of Ceph).


Quote from this notice:

"tl;dr:  Please avoid the 12.2.6 packages that are currently present on
download.ceph.com.  We will have a 12.2.7 published ASAP (probably
Monday).

If you do not use bluestore or erasure-coded pools, none of the issues
affect you.


Details:

We built 12.2.6 and pushed it to the repos Wednesday, but as that was
happening realized there was a potentially dangerous regression in
12.2.5[1] that an upgrade might exacerbate.  While we sorted that issue
out, several people noticed the updated version in the repo and
upgraded.  That turned up two other regressions[2][3].  We have fixes for
those, but are working on an additional fix to make the damage from [3]
be transparently repaired."



Regards,

Uwe



Am 14.07.2018 um 17:02 schrieb Glen Baars:

Hello Ceph users!

Note to users, don't install new servers on Friday the 13th!

We added a new ceph node on Friday and it has received the latest 12.2.6 
update. I started to see CRC errors and investigated hardware issues. I have 
since found that it is caused by the 12.2.6 release. About 80TB copied onto 
this server.

I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( ceph pg 
repair ) . This has cleared the errors.

* no idea if this is a good way to fix the issue. From the bug report this 
issue is in the deepscrub and therefore I suppose stopping it will limit the 
issues. ***

Can anyone tell me what to do? Downgrade seems that it won't fix the issue. 
Maybe remove this node and rebuild with 12.2.5 and resync data? Wait a few days 
for 12.2.7?

Kind regards,
Glen Baars
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.6 CRC errors

2018-07-14 Thread Glen Baars
Thanks Uwe,

I was that on the website.

Any idea if what I have done is correct? Do I now just wait?

Sent from my Cyanogen phone

On 14 Jul 2018 11:16 PM, Uwe Sauter  wrote:
Hi Glen,

about 16h ago there has been a notice on this list with subject "IMPORTANT: 
broken luminous 12.2.6 release in repo, do
not upgrade" from Sage Weil (main developer of Ceph).

Quote from this notice:

"tl;dr:  Please avoid the 12.2.6 packages that are currently present on
download.ceph.com.  We will have a 12.2.7 published ASAP (probably
Monday).

If you do not use bluestore or erasure-coded pools, none of the issues
affect you.


Details:

We built 12.2.6 and pushed it to the repos Wednesday, but as that was
happening realized there was a potentially dangerous regression in
12.2.5[1] that an upgrade might exacerbate.  While we sorted that issue
out, several people noticed the updated version in the repo and
upgraded.  That turned up two other regressions[2][3].  We have fixes for
those, but are working on an additional fix to make the damage from [3]
be transparently repaired."



Regards,

Uwe



Am 14.07.2018 um 17:02 schrieb Glen Baars:
> Hello Ceph users!
>
> Note to users, don't install new servers on Friday the 13th!
>
> We added a new ceph node on Friday and it has received the latest 12.2.6 
> update. I started to see CRC errors and investigated hardware issues. I have 
> since found that it is caused by the 12.2.6 release. About 80TB copied onto 
> this server.
>
> I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( ceph pg 
> repair ) . This has cleared the errors.
>
> * no idea if this is a good way to fix the issue. From the bug report 
> this issue is in the deepscrub and therefore I suppose stopping it will limit 
> the issues. ***
>
> Can anyone tell me what to do? Downgrade seems that it won't fix the issue. 
> Maybe remove this node and rebuild with 12.2.5 and resync data? Wait a few 
> days for 12.2.7?
>
> Kind regards,
> Glen Baars
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.6 CRC errors

2018-07-14 Thread Sage Weil
On Sat, 14 Jul 2018, Glen Baars wrote:
> Hello Ceph users!
> 
> Note to users, don't install new servers on Friday the 13th!
> 
> We added a new ceph node on Friday and it has received the latest 12.2.6 
> update. I started to see CRC errors and investigated hardware issues. I 
> have since found that it is caused by the 12.2.6 release. About 80TB 
> copied onto this server.
> 
> I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( 
> ceph pg repair ) . This has cleared the errors.
> 
> * no idea if this is a good way to fix the issue. From the bug 
> report this issue is in the deepscrub and therefore I suppose stopping 
> it will limit the issues. ***
> 
> Can anyone tell me what to do? Downgrade seems that it won't fix the 
> issue. Maybe remove this node and rebuild with 12.2.5 and resync data? 
> Wait a few days for 12.2.7?

I would sit tight for now.  I'm working on the right fix and hope to 
having something to test shortly, and possibly a release by tomorrow.

There is a remaining danger is that for the objects with bad full-object 
digests, that a read of the entire object will throw an EIO.  It's up 
to you whether you want to try to quiesce workloads to avoid that (to 
prevent corruption at higher layers) or avoid a service 
degradation/outage.  :(  Unfortunately I don't have super precise guidance 
as far as how likely that is.

Are you using bluestore only, or is it a mix of bluestore and filestore?

sage


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD fails to start after power failure

2018-07-14 Thread David Young

Hey folks,

I have a Luminous 12.2.6 cluster which suffered a power failure 
recently. On recovery, one of my OSDs is continually crashing and 
restarting, with the error below:



9ae00 con 0
    -3> 2018-07-15 09:50:58.313242 7f131c5a9700 10 monclient: tick
    -2> 2018-07-15 09:50:58.313277 7f131c5a9700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 2018-07-15 
09:50:28.313274)
    -1> 2018-07-15 09:50:58.313320 7f131c5a9700 10 log_client  
log_queue is 8 last_log 10 sent 0 num 8 unsent 10 sending 10
 0> 2018-07-15 09:50:58.320255 7f131c5a9700 -1 
/build/ceph-12.2.6/src/common/LogClient.cc: In function 'Message* 
LogClient::_get_mon_log_message()' thread 7f131c5a9700 time 2018-07-15 
09:50:58.313336
/build/ceph-12.2.6/src/common/LogClient.cc: 294: FAILED 
assert(num_unsent <= log_queue.size())




I've found a few recent references to this "FAILED assert" message 
(assuming that's the cause of the problem), such as 
https://bugzilla.redhat.com/show_bug.cgi?id=1599718 and 
http://tracker.ceph.com/issues/18209, with the most recent occurance 
being 3 days ago (http://tracker.ceph.com/issues/18209#note-12).


Is there any resolution to this issue, or anything I can attempt to recover?

Thanks!
D


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD fails to start after power failure (with FAILED assert(num_unsent <= log_queue.size()) error)

2018-07-14 Thread David Young

Hey folks,

Sorry, posting this from a second account, since for some reason my 
primary account doesn't seem tobeable to post to the list...


I have a Luminous 12.2.6 cluster which suffered a power failure 
recently. On recovery, one of my OSDs is continually crashing and 
restarting, with the error below:



9ae00 con 0
    -3> 2018-07-15 09:50:58.313242 7f131c5a9700 10 monclient: tick
    -2> 2018-07-15 09:50:58.313277 7f131c5a9700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 2018-07-15 
09:50:28.313274)
    -1> 2018-07-15 09:50:58.313320 7f131c5a9700 10 log_client  
log_queue is 8 last_log 10 sent 0 num 8 unsent 10 sending 10
 0> 2018-07-15 09:50:58.320255 7f131c5a9700 -1 
/build/ceph-12.2.6/src/common/LogClient.cc: In function 'Message* 
LogClient::_get_mon_log_message()' thread 7f131c5a9700 time 2018-07-15 
09:50:58.313336
/build/ceph-12.2.6/src/common/LogClient.cc: 294: FAILED 
assert(num_unsent <= log_queue.size())




I've found a few recent references to this "FAILED assert" message 
(assuming that's the cause of the problem), such as 
https://bugzilla.redhat.com/show_bug.cgi?id=1599718 and 
http://tracker.ceph.com/issues/18209, with the most recent occurance 
being 3 days ago (http://tracker.ceph.com/issues/18209#note-12).


Is there any resolution to this issue, or anything I can attempt to recover?

Thanks!
D


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com