[ceph-users] Re: what-does-nosuchkey-error-mean-while-subscribing-for-notification-in-ceph

2021-04-16 Thread David Caro

What does notif.xml have in it?

Looking at the docs you linked, I say that it does not find the `S3Key` from 
that xml for whatever reason.

On 04/16 06:54, Szabo, Istvan (Agoda) wrote:
> Hi,
> 
> 
> I am trying to follow this url 
> https://docs.ceph.com/en/latest/radosgw/s3/bucketops/#create-notification
> 
> to create a publisher for my bucket into a topic.
> 
> My curl:
> 
> curl -v -H 'Date: Fri, 16 Apr 2021 05:21:14 +' -H 'Authorization: AWS 
> accessid:secretkey' -L -H 'content-type: text/xml' -H 'Content-MD5: 
> pBRX39Oo7aAUYbilIYMoAw==' -T notif.xml http://ceph:8080/vig-test?notification
> 
> and it returns me this error
> 
> 
> 
> 
> 
>   NoSuchKey
> 
>   vig-test
> 
>   tx0016ac570-0060791ecb-1c7e96b-hkg
> 
>   1c7e96b-hkg-data
> 
> 
> 
> 
> Does anybody know what does this error mean in Ceph? How can I proceed?
> 
> 
> Thank you
> 
> 
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any confidentiality or privilege 
> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
> the message. All messages sent to and from Agoda may be monitored to ensure 
> compliance with company policies, to protect the company's interests and to 
> remove potential malware. Electronic messages may be intercepted, amended, 
> lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
David Caro
SRE - Cloud Services
Wikimedia Foundation 
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: s3 requires twice the space it should use

2021-04-16 Thread Boris Behrens
Could this also be failed multipart uploads?

Am Do., 15. Apr. 2021 um 18:23 Uhr schrieb Boris Behrens :

> Cheers,
>
> [root@s3db1 ~]#  ceph daemon osd.23 perf dump | grep numpg
> "numpg": 187,
> "numpg_primary": 64,
> "numpg_replica": 121,
> "numpg_stray": 2,
> "numpg_removing": 0,
>
>
> Am Do., 15. Apr. 2021 um 18:18 Uhr schrieb 胡 玮文 :
>
>> Hi Boris,
>>
>> Could you check something like
>>
>> ceph daemon osd.23 perf dump | grep numpg
>>
>> to see if there are some stray or removing PG?
>>
>> Weiwen Hu
>>
>> > 在 2021年4月15日,22:53,Boris Behrens  写道:
>> >
>> > Ah you are right.
>> > [root@s3db1 ~]# ceph daemon osd.23 config get
>> bluestore_min_alloc_size_hdd
>> > {
>> >"bluestore_min_alloc_size_hdd": "65536"
>> > }
>> > But I also checked how many objects our s3 hold and the numbers just do
>> not
>> > add up.
>> > There are only 26509200 objects, which would result in around 1TB
>> "waste"
>> > if every object would be empty.
>> >
>> > I think the problem began when I updated the PG count from 1024 to 2048.
>> > Could there be an issue where the data is written twice?
>> >
>> >
>> >> Am Do., 15. Apr. 2021 um 16:48 Uhr schrieb Amit Ghadge <
>> amitg@gmail.com
>> >>> :
>> >>
>> >> verify those two parameter values ,bluestore_min_alloc_size_hdd &
>> >> bluestore_min_alloc_size_sdd, If you are using hdd disk then
>> >> bluestore_min_alloc_size_hdd are applicable.
>> >>
>> >>> On Thu, Apr 15, 2021 at 8:06 PM Boris Behrens  wrote:
>> >>>
>> >>> So, I need to live with it? A value of zero leads to use the default?
>> >>> [root@s3db1 ~]# ceph daemon osd.23 config get
>> bluestore_min_alloc_size
>> >>> {
>> >>>"bluestore_min_alloc_size": "0"
>> >>> }
>> >>>
>> >>> I also checked the fragmentation on the bluestore OSDs and it is
>> around
>> >>> 0.80 - 0.89 on most OSDs. yikes.
>> >>> [root@s3db1 ~]# ceph daemon osd.23 bluestore allocator score block
>> >>> {
>> >>>"fragmentation_rating": 0.85906054329923576
>> >>> }
>> >>>
>> >>> The problem I currently have is, that I barely keep up with adding OSD
>> >>> disks.
>> >>>
>> >>> Am Do., 15. Apr. 2021 um 16:18 Uhr schrieb Amit Ghadge <
>> >>> amitg@gmail.com>:
>> >>>
>>  size_kb_actual are actually bucket object size but on OSD level the
>>  bluestore_min_alloc_size default 64KB and SSD are 16KB
>> 
>> 
>> 
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Faccess.redhat.com%2Fdocumentation%2Fen-us%2Fred_hat_ceph_storage%2F3%2Fhtml%2Fadministration_guide%2Fosd-bluestore&data=04%7C01%7C%7Cba98c0dff13941ea96ff08d9001e3759%7C84df9e7fe9f640afb435%7C1%7C0%7C637540952043049058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wfSqqyiDHRXp4ypOGTxx4p%2Buy902OGPEmGkNfJ2BF6I%3D&reserved=0
>> 
>>  -AmitG
>> 
>>  On Thu, Apr 15, 2021 at 7:29 PM Boris Behrens  wrote:
>> 
>> > Hi,
>> >
>> > maybe it is just a problem in my understanding, but it looks like
>> our s3
>> > requires twice the space it should use.
>> >
>> > I ran "radosgw-admin bucket stats", and added all "size_kb_actual"
>> > values
>> > up and divided to TB (/1024/1024/1024).
>> > The resulting space is 135,1636733 TB. When I tripple it because of
>> > replication I end up with around 405TB which is nearly half the
>> space of
>> > what ceph df tells me.
>> >
>> > Hope someone can help me.
>> >
>> > ceph df shows
>> > RAW STORAGE:
>> >CLASS SIZE AVAIL   USEDRAW USED %RAW
>> > USED
>> >hdd   1009 TiB 189 TiB 820 TiB  820 TiB
>> > 81.26
>> >TOTAL 1009 TiB 189 TiB 820 TiB  820 TiB
>> > 81.26
>> >
>> > POOLS:
>> >POOLID PGS  STORED
>> > OBJECTS
>> >USED%USED MAX AVAIL
>> >rbd  0   64 0 B
>> >   0
>> >0 B 018 TiB
>> >.rgw.root1   64  99 KiB
>> > 119
>> > 99 KiB 018 TiB
>> >eu-central-1.rgw.control 2   64 0 B
>> >   8
>> >0 B 018 TiB
>> >eu-central-1.rgw.data.root   3   64 1.0 MiB
>> > 3.15k
>> >1.0 MiB 018 TiB
>> >eu-central-1.rgw.gc  4   64  71 MiB
>> >  32
>> > 71 MiB 018 TiB
>> >eu-central-1.rgw.log 5   64 267 MiB
>> > 564
>> >267 MiB 018 TiB
>> >eu-central-1.rgw.users.uid   6   64 2.8 MiB
>> > 6.91k
>> >2.8 MiB 018 TiB
>> >eu-central-1.rgw.users.keys  7   64 263 KiB
>> > 6.73k
>> >263 KiB 018 TiB
>> >eu-c

[ceph-users] Re: what-does-nosuchkey-error-mean-while-subscribing-for-notification-in-ceph

2021-04-16 Thread Yuval Lifshitz
the "Filter" tag is optional in the XML, so I don't think this is the issue.
Note that the bucket and topic have to exist when you create the
notification.

Can you try creating the notification using the AWS CLI tool instead of
CURL?
You can see examples here:
https://github.com/ceph/ceph/tree/master/examples/boto3

On Fri, Apr 16, 2021 at 12:12 PM Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com> wrote:

> This one:
>
> http://s3.amazonaws.com/doc/2006-03-01/
> ">
> 
> id1
> arn:aws:sns:data::testcephevent
> 
> 
>
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2021. Apr 16., at 14:58, David Caro  wrote:
>
> 
> What does notif.xml have in it?
>
> Looking at the docs you linked, I say that it does not find the `S3Key`
> from that xml for whatever reason.
>
> On 04/16 06:54, Szabo, Istvan (Agoda) wrote:
> Hi,
>
>
> I am trying to follow this url
> https://docs.ceph.com/en/latest/radosgw/s3/bucketops/#create-notification
>
> to create a publisher for my bucket into a topic.
>
> My curl:
>
> curl -v -H 'Date: Fri, 16 Apr 2021 05:21:14 +' -H 'Authorization: AWS
> accessid:secretkey' -L -H 'content-type: text/xml' -H 'Content-MD5:
> pBRX39Oo7aAUYbilIYMoAw==' -T notif.xml
> http://ceph:8080/vig-test?notification
>
> and it returns me this error
>
> 
>
> 
>
>  NoSuchKey
>
>  vig-test
>
>  tx0016ac570-0060791ecb-1c7e96b-hkg
>
>  1c7e96b-hkg-data
>
> 
>
>
> Does anybody know what does this error mean in Ceph? How can I proceed?
>
>
> Thank you
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> David Caro
> SRE - Cloud Services
> Wikimedia Foundation 
> PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3
>
> "Imagine a world in which every single human being can freely share in the
> sum of all knowledge. That's our commitment."
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] what-does-nosuchkey-error-mean-while-subscribing-for-notification-in-ceph

2021-04-16 Thread Szabo, Istvan (Agoda)
Hi,


I am trying to follow this url 
https://docs.ceph.com/en/latest/radosgw/s3/bucketops/#create-notification

to create a publisher for my bucket into a topic.

My curl:

curl -v -H 'Date: Fri, 16 Apr 2021 05:21:14 +' -H 'Authorization: AWS 
accessid:secretkey' -L -H 'content-type: text/xml' -H 'Content-MD5: 
pBRX39Oo7aAUYbilIYMoAw==' -T notif.xml http://ceph:8080/vig-test?notification

and it returns me this error





  NoSuchKey

  vig-test

  tx0016ac570-0060791ecb-1c7e96b-hkg

  1c7e96b-hkg-data




Does anybody know what does this error mean in Ceph? How can I proceed?


Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph/ceph-grafana docker image for arm64 missing

2021-04-16 Thread mabi
Hello,

I want to deploy a new ceph Octopus cluster using cephadm on arm64 architecture 
but unfortunately the ceph/ceph-grafana docker image for arm64 is missing.

Is this mailing list the right place to report this? or where should I report 
that?

Best regards,
Mabi



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Can't get one OSD (out of 14) to start

2021-04-16 Thread Mark Johnson
Really not sure where to go with this one.  Firstly, a description of my 
cluster.  Yes, I know there are a lot of "not ideals" here but this is what I 
inherited.

The cluster is running Jewel and has two storage/mon nodes and an additional 
mon only node, with a pool size of 2.  Today, we had a some power issues in the 
data centre and we very ungracefully lost both storage servers at the same 
time.  Node 1 came back online before node 2 but I could see there were a few 
OSDs that were down.  When node 2 came back, I started trying to get OSDs up.  
Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but 
one of the OSDs on node 1 keeps starting and crashing and just won't stay up.  
I'm not finding the OSD log output to be much use.  Current health status looks 
like this:

# ceph health
HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 
pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are 
blocked > 32 sec
# ceph status
cluster e2391bbf-15e0-405f-af12-943610cb4909
 health HEALTH_ERR
26 pgs are stuck inactive for more than 300 seconds
26 pgs down
26 pgs peering
26 pgs stuck inactive
26 pgs stuck unclean
5 requests are blocked > 32 sec

Any clues as to what I should be looking for or what sort of action I should be 
taking to troubleshoot this?  Unfortunately, I'm a complete novice with Ceph.

Here's a snippet from the OSD log that means little to me...

--- begin dump of recent events ---
 0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) 
**
 in thread 7f2e23921ac0 thread_name:ceph-osd

 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
 1: (()+0x9f1c2a) [0x7f2e24330c2a]
 2: (()+0xf5d0) [0x7f2e21ee95d0]
 3: (gsignal()+0x37) [0x7f2e2049f207]
 4: (abort()+0x148) [0x7f2e204a08f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x267) [0x7f2e2442fd47]
 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) 
[0x7f2e2417bc7c]
 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) 
[0x7f2e240c8dce]
 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]
 9: (OSD::init()+0x27d) [0x7f2e23d5828d]
 10: (main()+0x2c18) [0x7f2e23c71088]
 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]
 12: (()+0x3c8847) [0x7f2e23d07847]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

Thanks in advance,
Mark

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: what-does-nosuchkey-error-mean-while-subscribing-for-notification-in-ceph

2021-04-16 Thread Szabo, Istvan (Agoda)
This one:

http://s3.amazonaws.com/doc/2006-03-01/";>

id1
arn:aws:sns:data::testcephevent





Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Apr 16., at 14:58, David Caro  wrote:


What does notif.xml have in it?

Looking at the docs you linked, I say that it does not find the `S3Key` from 
that xml for whatever reason.

On 04/16 06:54, Szabo, Istvan (Agoda) wrote:
Hi,


I am trying to follow this url 
https://docs.ceph.com/en/latest/radosgw/s3/bucketops/#create-notification

to create a publisher for my bucket into a topic.

My curl:

curl -v -H 'Date: Fri, 16 Apr 2021 05:21:14 +' -H 'Authorization: AWS 
accessid:secretkey' -L -H 'content-type: text/xml' -H 'Content-MD5: 
pBRX39Oo7aAUYbilIYMoAw==' -T notif.xml http://ceph:8080/vig-test?notification

and it returns me this error





 NoSuchKey

 vig-test

 tx0016ac570-0060791ecb-1c7e96b-hkg

 1c7e96b-hkg-data




Does anybody know what does this error mean in Ceph? How can I proceed?


Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
David Caro
SRE - Cloud Services
Wikimedia Foundation 
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't get one OSD (out of 14) to start

2021-04-16 Thread Mark Johnson
I ran an fsck on the problem OSD and found and repaired a couple of errors.  
Remounted and started the OSD but it crashed again shortly after as before.  So 
(and possibly from bad advise) I figured I'd mark the OSD lost and let it write 
out the pgs to other OSDs which it's in the process of backfilling.  However, 
I'm seeing 1 down+incomplete and 3 incomplete and I'm expecting that these 
won't recover.

So, would love to know what my options are here when all the backfilling has 
finished (or stalled).  Losing data or even entire PGs isn't a big problem as 
this cluster is really just a replica of our main cluster so we can restore 
lost objects manually from there.  Is there a way I can clear 
out/repair/whatever these pgs so I can get a healthy cluster again?

Yes, I know this would have probably been easier with an additional storage 
server and a pool size of 3.  But that's not going to help me right now.



-Original Message-
From: Mark Johnson 
mailto:mark%20johnson%20%3cma...@iovox.com%3e>>
To: ceph-users@ceph.io 
mailto:%22ceph-us...@ceph.io%22%20%3cceph-us...@ceph.io%3e>>
Subject: [ceph-users] Can't get one OSD (out of 14) to start
Date: Fri, 16 Apr 2021 12:43:33 +


Really not sure where to go with this one.  Firstly, a description of my 
cluster.  Yes, I know there are a lot of "not ideals" here but this is what I 
inherited.


The cluster is running Jewel and has two storage/mon nodes and an additional 
mon only node, with a pool size of 2.  Today, we had a some power issues in the 
data centre and we very ungracefully lost both storage servers at the same 
time.  Node 1 came back online before node 2 but I could see there were a few 
OSDs that were down.  When node 2 came back, I started trying to get OSDs up.  
Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but 
one of the OSDs on node 1 keeps starting and crashing and just won't stay up.  
I'm not finding the OSD log output to be much use.  Current health status looks 
like this:


# ceph health

HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 
pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are 
blocked > 32 sec

# ceph status

cluster e2391bbf-15e0-405f-af12-943610cb4909

 health HEALTH_ERR

26 pgs are stuck inactive for more than 300 seconds

26 pgs down

26 pgs peering

26 pgs stuck inactive

26 pgs stuck unclean

5 requests are blocked > 32 sec


Any clues as to what I should be looking for or what sort of action I should be 
taking to troubleshoot this?  Unfortunately, I'm a complete novice with Ceph.


Here's a snippet from the OSD log that means little to me...


--- begin dump of recent events ---

 0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) 
**

 in thread 7f2e23921ac0 thread_name:ceph-osd


 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)

 1: (()+0x9f1c2a) [0x7f2e24330c2a]

 2: (()+0xf5d0) [0x7f2e21ee95d0]

 3: (gsignal()+0x37) [0x7f2e2049f207]

 4: (abort()+0x148) [0x7f2e204a08f8]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x267) [0x7f2e2442fd47]

 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) 
[0x7f2e2417bc7c]

 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) 
[0x7f2e240c8dce]

 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]

 9: (OSD::init()+0x27d) [0x7f2e23d5828d]

 10: (main()+0x2c18) [0x7f2e23c71088]

 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]

 12: (()+0x3c8847) [0x7f2e23d07847]

 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


Thanks in advance,

Mark


___

ceph-users mailing list --



ceph-users@ceph.io


To unsubscribe send an email to



ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Octopus - unbalanced OSDs

2021-04-16 Thread Ml Ml
Hello List,

any ideas why my OSDs are that unbalanced ?

root@ceph01:~# ceph -s
  cluster:
id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df
health: HEALTH_WARN
1 nearfull osd(s)
4 pool(s) nearfull

  services:
mon: 3 daemons, quorum ceph03,ceph01,ceph02 (age 2w)
mgr: ceph03(active, since 4M), standbys: ceph02.jwvivm
mds: backup:1 {0=backup.ceph06.hdjehi=up:active} 3 up:standby
osd: 56 osds: 56 up (since 29h), 56 in (since 3d)

  task status:
scrub status:
mds.backup.ceph06.hdjehi: idle

  data:
pools:   4 pools, 1185 pgs
objects: 24.29M objects, 44 TiB
usage:   151 TiB used, 55 TiB / 206 TiB avail
pgs: 675 active+clean
 476 active+clean+snaptrim_wait
 30  active+clean+snaptrim
 4   active+clean+scrubbing+deep

root@ceph01:~# ceph osd df tree
ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP
META AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
 -1 206.79979 -  206 TiB  151 TiB  151 TiB   36 GiB
503 GiB   55 TiB  73.23  1.00-  root default
 -2  28.89995 -   29 TiB   20 TiB   20 TiB  5.5 GiB
74 GiB  8.9 TiB  69.19  0.94-  host ceph01
  0hdd2.7   1.0  2.7 TiB  1.8 TiB  1.8 TiB  590 MiB
6.9 GiB  908 GiB  66.81  0.91   44  up  osd.0
  1hdd2.7   1.0  2.7 TiB  1.6 TiB  1.6 TiB  411 MiB
6.5 GiB  1.1 TiB  60.43  0.83   39  up  osd.1
  4hdd2.7   1.0  2.7 TiB  1.8 TiB  1.8 TiB  501 MiB
6.8 GiB  898 GiB  67.15  0.92   43  up  osd.4
  8hdd2.7   1.0  2.7 TiB  2.0 TiB  2.0 TiB  453 MiB
7.0 GiB  700 GiB  74.39  1.02   47  up  osd.8
 11hdd1.7   1.0  1.7 TiB  1.3 TiB  1.3 TiB  356 MiB
5.6 GiB  433 GiB  75.39  1.03   31  up  osd.11
 12hdd2.7   1.0  2.7 TiB  2.1 TiB  2.1 TiB  471 MiB
7.0 GiB  591 GiB  78.40  1.07   48  up  osd.12
 14hdd2.7   1.0  2.7 TiB  1.6 TiB  1.6 TiB  448 MiB
6.0 GiB  1.1 TiB  59.68  0.82   38  up  osd.14
 18hdd2.7   1.0  2.7 TiB  1.7 TiB  1.7 TiB  515 MiB
6.2 GiB  980 GiB  64.15  0.88   41  up  osd.18
 22hdd1.7   1.0  1.7 TiB  1.2 TiB  1.2 TiB  360 MiB
4.2 GiB  491 GiB  72.06  0.98   29  up  osd.22
 30hdd1.7   1.0  1.7 TiB  1.2 TiB  1.2 TiB  366 MiB
4.7 GiB  558 GiB  68.26  0.93   28  up  osd.30
 33hdd1.5   1.0  1.6 TiB  1.2 TiB  1.2 TiB  406 MiB
4.9 GiB  427 GiB  74.28  1.01   29  up  osd.33
 64hdd3.2   1.0  3.3 TiB  2.4 TiB  2.4 TiB  736 MiB
8.6 GiB  915 GiB  73.22  1.00   60  up  osd.64
 -3  29.69995 -   30 TiB   22 TiB   22 TiB  5.4 GiB
81 GiB  7.9 TiB  73.20  1.00-  host ceph02
  2hdd1.7   1.0  1.7 TiB  1.3 TiB  1.2 TiB  402 MiB
5.2 GiB  476 GiB  72.93  1.00   30  up  osd.2
  3hdd2.7   1.0  2.7 TiB  2.0 TiB  2.0 TiB  653 MiB
7.8 GiB  652 GiB  76.15  1.04   49  up  osd.3
  7hdd2.7   1.0  2.7 TiB  2.5 TiB  2.5 TiB  456 MiB
7.7 GiB  209 GiB  92.36  1.26   56  up  osd.7
  9hdd2.7   1.0  2.7 TiB  1.9 TiB  1.9 TiB  434 MiB
7.2 GiB  781 GiB  71.46  0.98   46  up  osd.9
 13hdd2.3   1.0  2.4 TiB  1.6 TiB  1.6 TiB  451 MiB
6.1 GiB  823 GiB  66.28  0.91   38  up  osd.13
 16hdd2.7   1.0  2.7 TiB  1.6 TiB  1.6 TiB  375 MiB
6.4 GiB  1.1 TiB  59.84  0.82   39  up  osd.16
 19hdd1.7   1.0  1.7 TiB  1.1 TiB  1.1 TiB  323 MiB
4.7 GiB  601 GiB  65.80  0.90   27  up  osd.19
 23hdd2.7   1.0  2.7 TiB  2.2 TiB  2.2 TiB  471 MiB
7.7 GiB  520 GiB  80.99  1.11   50  up  osd.23
 24hdd1.7   1.0  1.7 TiB  1.4 TiB  1.4 TiB  371 MiB
5.5 GiB  273 GiB  84.44  1.15   32  up  osd.24
 28hdd2.7   1.0  2.7 TiB  1.9 TiB  1.9 TiB  428 MiB
7.4 GiB  818 GiB  70.07  0.96   44  up  osd.28
 31hdd2.7   1.0  2.7 TiB  2.0 TiB  2.0 TiB  516 MiB
7.4 GiB  660 GiB  75.85  1.04   48  up  osd.31
 32hdd3.2   1.0  3.3 TiB  2.2 TiB  2.2 TiB  661 MiB
7.9 GiB  1.2 TiB  64.86  0.89   52  up  osd.32
 -4  26.29996 -   26 TiB   18 TiB   18 TiB  4.3 GiB
73 GiB  8.0 TiB  69.58  0.95-  host ceph03
  5hdd1.7   1.0  1.7 TiB  1.2 TiB  1.2 TiB  298 MiB
5.2 GiB  541 GiB  69.21  0.95   29  up  osd.5
  6hdd1.7   1.0  1.7 TiB  1.0 TiB  1.0 TiB  321 MiB
4.4 GiB  697 GiB  60.34  0.82   25  up  osd.6
 10hdd2.7   1.0  2.7 TiB  1.9 TiB  1.9 TiB  431 MiB
7.5 GiB  796 GiB  70.89  0.97   46  up  osd.10
 15hdd2.7   1.0  2.7 TiB  1.9 TiB  1.9 TiB  500 MiB
6.6 GiB  805 GiB  70.5

[ceph-users] Re: Can't get one OSD (out of 14) to start

2021-04-16 Thread Alex Gorbachev
Hi Mark,

I wonder if the following will help you:
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/

There are instructions there on how to mark unfound PGs lost and delete
them.  You will regain a healthy cluster that way, and then you can adjust
replica counts etc to best practice, and restore your objects.

Best regards,
--
Alex Gorbachev
ISS/Storcium



On Fri, Apr 16, 2021 at 10:51 AM Mark Johnson  wrote:

> I ran an fsck on the problem OSD and found and repaired a couple of
> errors.  Remounted and started the OSD but it crashed again shortly after
> as before.  So (and possibly from bad advise) I figured I'd mark the OSD
> lost and let it write out the pgs to other OSDs which it's in the process
> of backfilling.  However, I'm seeing 1 down+incomplete and 3 incomplete and
> I'm expecting that these won't recover.
>
> So, would love to know what my options are here when all the backfilling
> has finished (or stalled).  Losing data or even entire PGs isn't a big
> problem as this cluster is really just a replica of our main cluster so we
> can restore lost objects manually from there.  Is there a way I can clear
> out/repair/whatever these pgs so I can get a healthy cluster again?
>
> Yes, I know this would have probably been easier with an additional
> storage server and a pool size of 3.  But that's not going to help me right
> now.
>
>
>
> -Original Message-
> From: Mark Johnson  mark%20johnson%20%3cma...@iovox.com%3e>>
> To: ceph-users@ceph.io mailto:%22ceph-us...@ceph.io%
> 22%20%3cceph-us...@ceph.io%3e>>
> Subject: [ceph-users] Can't get one OSD (out of 14) to start
> Date: Fri, 16 Apr 2021 12:43:33 +
>
>
> Really not sure where to go with this one.  Firstly, a description of my
> cluster.  Yes, I know there are a lot of "not ideals" here but this is what
> I inherited.
>
>
> The cluster is running Jewel and has two storage/mon nodes and an
> additional mon only node, with a pool size of 2.  Today, we had a some
> power issues in the data centre and we very ungracefully lost both storage
> servers at the same time.  Node 1 came back online before node 2 but I
> could see there were a few OSDs that were down.  When node 2 came back, I
> started trying to get OSDs up.  Each node has 14 OSDs and I managed to get
> all OSDs up and in on node 2, but one of the OSDs on node 1 keeps starting
> and crashing and just won't stay up.  I'm not finding the OSD log output to
> be much use.  Current health status looks like this:
>
>
> # ceph health
>
> HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs
> down; 26 pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5
> requests are blocked > 32 sec
>
> # ceph status
>
> cluster e2391bbf-15e0-405f-af12-943610cb4909
>
>  health HEALTH_ERR
>
> 26 pgs are stuck inactive for more than 300 seconds
>
> 26 pgs down
>
> 26 pgs peering
>
> 26 pgs stuck inactive
>
> 26 pgs stuck unclean
>
> 5 requests are blocked > 32 sec
>
>
> Any clues as to what I should be looking for or what sort of action I
> should be taking to troubleshoot this?  Unfortunately, I'm a complete
> novice with Ceph.
>
>
> Here's a snippet from the OSD log that means little to me...
>
>
> --- begin dump of recent events ---
>
>  0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal
> (Aborted) **
>
>  in thread 7f2e23921ac0 thread_name:ceph-osd
>
>
>  ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
>
>  1: (()+0x9f1c2a) [0x7f2e24330c2a]
>
>  2: (()+0xf5d0) [0x7f2e21ee95d0]
>
>  3: (gsignal()+0x37) [0x7f2e2049f207]
>
>  4: (abort()+0x148) [0x7f2e204a08f8]
>
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x267) [0x7f2e2442fd47]
>
>  6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
> bool*)+0x90c) [0x7f2e2417bc7c]
>
>  7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee)
> [0x7f2e240c8dce]
>
>  8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]
>
>  9: (OSD::init()+0x27d) [0x7f2e23d5828d]
>
>  10: (main()+0x2c18) [0x7f2e23c71088]
>
>  11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]
>
>  12: (()+0x3c8847) [0x7f2e23d07847]
>
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
>
> Thanks in advance,
>
> Mark
>
>
> ___
>
> ceph-users mailing list --
>
> 
>
> ceph-users@ceph.io
>
>
> To unsubscribe send an email to
>
> 
>
> ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm: how to create more than 1 rgw per host

2021-04-16 Thread i...@z1storage.com

Hello,

According to the documentation, there's count-per-host key to 'ceph 
orch', but it does not work for me:


:~# ceph orch apply rgw z1 sa-1 --placement='label:rgw count-per-host:2' 
--port=8000 --dry-run

Error EINVAL: Host and label are mutually exclusive

Why it says anything about Host if I don't specify any hosts, just labels?

~# ceph orch host ls
HOST  ADDR  LABELS   STATUS
s101  s101  mon rgw
s102  s102  mgr mon rgw
s103  s103  mon rgw
s104  s104  mgr mon rgw
s105  s105  mgr mon rgw
s106  s106  mon rgw
s107  s107  mon rgw

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't get one OSD (out of 14) to start

2021-04-16 Thread Mark Johnson
That's the exact same page I used to mark the osd as lost.  Nothing in there 
seems to reference the incomplete and down+incomplete pgs that I have however 
so I really don't know if it helps me.  I don't really understand what my 
problem is here.



-Original Message-
From: Alex Gorbachev 
mailto:alex%20gorbachev%20%3...@iss-integration.com%3e>>
To: Mark Johnson 
mailto:mark%20johnson%20%3cma...@iovox.com%3e>>
Cc: ceph-users@ceph.io 
mailto:%22ceph-us...@ceph.io%22%20%3cceph-us...@ceph.io%3e>>
Subject: Re: [ceph-users] Re: Can't get one OSD (out of 14) to start
Date: Fri, 16 Apr 2021 14:16:28 -0400

Hi Mark,

I wonder if the following will help you: 
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/

There are instructions there on how to mark unfound PGs lost and delete them.  
You will regain a healthy cluster that way, and then you can adjust replica 
counts etc to best practice, and restore your objects.

Best regards,
--
Alex Gorbachev
ISS/Storcium



On Fri, Apr 16, 2021 at 10:51 AM Mark Johnson 
mailto:ma...@iovox.com>> wrote:
I ran an fsck on the problem OSD and found and repaired a couple of errors.  
Remounted and started the OSD but it crashed again shortly after as before.  So 
(and possibly from bad advise) I figured I'd mark the OSD lost and let it write 
out the pgs to other OSDs which it's in the process of backfilling.  However, 
I'm seeing 1 down+incomplete and 3 incomplete and I'm expecting that these 
won't recover.

So, would love to know what my options are here when all the backfilling has 
finished (or stalled).  Losing data or even entire PGs isn't a big problem as 
this cluster is really just a replica of our main cluster so we can restore 
lost objects manually from there.  Is there a way I can clear 
out/repair/whatever these pgs so I can get a healthy cluster again?

Yes, I know this would have probably been easier with an additional storage 
server and a pool size of 3.  But that's not going to help me right now.



-Original Message-
From: Mark Johnson 
mailto:ma...@iovox.com>%3e>>
To: ceph-users@ceph.io 
mailto:ceph-users@ceph.io>%22%20%3cceph-us...@ceph.io%3e>>
Subject: [ceph-users] Can't get one OSD (out of 14) to start
Date: Fri, 16 Apr 2021 12:43:33 +


Really not sure where to go with this one.  Firstly, a description of my 
cluster.  Yes, I know there are a lot of "not ideals" here but this is what I 
inherited.


The cluster is running Jewel and has two storage/mon nodes and an additional 
mon only node, with a pool size of 2.  Today, we had a some power issues in the 
data centre and we very ungracefully lost both storage servers at the same 
time.  Node 1 came back online before node 2 but I could see there were a few 
OSDs that were down.  When node 2 came back, I started trying to get OSDs up.  
Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but 
one of the OSDs on node 1 keeps starting and crashing and just won't stay up.  
I'm not finding the OSD log output to be much use.  Current health status looks 
like this:


# ceph health

HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 
pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are 
blocked > 32 sec

# ceph status

cluster e2391bbf-15e0-405f-af12-943610cb4909

 health HEALTH_ERR

26 pgs are stuck inactive for more than 300 seconds

26 pgs down

26 pgs peering

26 pgs stuck inactive

26 pgs stuck unclean

5 requests are blocked > 32 sec


Any clues as to what I should be looking for or what sort of action I should be 
taking to troubleshoot this?  Unfortunately, I'm a complete novice with Ceph.


Here's a snippet from the OSD log that means little to me...


--- begin dump of recent events ---

 0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) 
**

 in thread 7f2e23921ac0 thread_name:ceph-osd


 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)

 1: (()+0x9f1c2a) [0x7f2e24330c2a]

 2: (()+0xf5d0) [0x7f2e21ee95d0]

 3: (gsignal()+0x37) [0x7f2e2049f207]

 4: (abort()+0x148) [0x7f2e204a08f8]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x267) [0x7f2e2442fd47]

 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) 
[0x7f2e2417bc7c]

 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) 
[0x7f2e240c8dce]

 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]

 9: (OSD::init()+0x27d) [0x7f2e23d5828d]

 10: (main()+0x2c18) [0x7f2e23c71088]

 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]

 12: (()+0x3c8847) [0x7f2e23d07847]

 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this

[ceph-users] Re: Can't get one OSD (out of 14) to start

2021-04-16 Thread Mark Johnson
All the backfill operations are complete and I'm now just left with the 3 
incomplete and 1 down+incomplete:

# ceph health detail
HEALTH_ERR 4 pgs are stuck inactive for more than 300 seconds; 1 pgs down; 4 
pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean; 266 requests are 
blocked > 32 sec; 3 osds have slow requests
pg 1.38 is stuck inactive for 80654.111975, current state incomplete, last 
acting [17,4]
pg 30.7a is stuck inactive for 76259.649932, current state incomplete, last 
acting [12,9]
pg 30.8d is stuck inactive for 76201.794001, current state incomplete, last 
acting [0,5]
pg 30.c1 is stuck inactive for 76305.051390, current state down+incomplete, 
last acting [14,25]
pg 1.38 is stuck unclean for 80654.112037, current state incomplete, last 
acting [17,4]
pg 30.7a is stuck unclean for 76259.649989, current state incomplete, last 
acting [12,9]
pg 30.8d is stuck unclean for 76201.794058, current state incomplete, last 
acting [0,5]
pg 30.c1 is stuck unclean for 76305.051447, current state down+incomplete, last 
acting [14,25]
pg 30.c1 is down+incomplete, acting [14,25]
pg 30.8d is incomplete, acting [0,5]
pg 30.7a is incomplete, acting [12,9]
pg 1.38 is incomplete, acting [17,4]
50 ops are blocked > 33554.4 sec on osd.14
16 ops are blocked > 16777.2 sec on osd.14
2 ops are blocked > 67108.9 sec on osd.12
98 ops are blocked > 33554.4 sec on osd.12
100 ops are blocked > 33554.4 sec on osd.0
3 osds have slow requests


I tried issuing a 'ceph pg repair' to one of those PGs and got the following:

# ceph pg repair 1.38
instructing pg 1.38 on osd.17 to repair

But it doesn't appear to be doing anything.  Health status still says the exact 
same thing.  No idea where to go from here.


-Original Message-
From: Mark Johnson 
mailto:mark%20johnson%20%3cma...@iovox.com%3e>>
To: a...@iss-integration.com 
mailto:%2...@iss-integration.com%22%20%3...@iss-integration.com%3e>>
Cc: ceph-users@ceph.io 
mailto:%22ceph-us...@ceph.io%22%20%3cceph-us...@ceph.io%3e>>
Subject: [ceph-users] Re: Can't get one OSD (out of 14) to start
Date: Fri, 16 Apr 2021 22:00:20 +


That's the exact same page I used to mark the osd as lost.  Nothing in there 
seems to reference the incomplete and down+incomplete pgs that I have however 
so I really don't know if it helps me.  I don't really understand what my 
problem is here.




-Original Message-

From: Alex Gorbachev <



a...@iss-integration.com

mailto:alex%20gorbachev%20%3...@iss-integration.com>

alex%20gorbachev%20%3...@iss-integration.com

%3e>>

To: Mark Johnson <



ma...@iovox.com

mailto:mark%20johnson%20%3cma...@iovox.com>

mark%20johnson%20%3cma...@iovox.com

%3e>>

Cc:



ceph-users@ceph.io

 <



ceph-users@ceph.io

mailto:%22ceph-us...@ceph.io>

%22ceph-us...@ceph.io



%22%20%3cceph-us...@ceph.io

%3e>>

Subject: Re: [ceph-users] Re: Can't get one OSD (out of 14) to start

Date: Fri, 16 Apr 2021 14:16:28 -0400


Hi Mark,


I wonder if the following will help you:



https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/



There are instructions there on how to mark unfound PGs lost and delete them.  
You will regain a healthy cluster that way, and then you can adjust replica 
counts etc to best practice, and restore your objects.


Best regards,

--

Alex Gorbachev

ISS/Storcium




On Fri, Apr 16, 2021 at 10:51 AM Mark Johnson <



ma...@iovox.com

mailto:ma...@iovox.com>

ma...@iovox.com

>> wrote:

I ran an fsck on the problem OSD and found and repaired a couple of errors.  
Remounted and started the OSD but it crashed again shortly after as before.  So 
(and possibly from bad advise) I figured I'd mark the OSD lost and let it write 
out the pgs to other OSDs which it's in the process of backfilling.  However, 
I'm seeing 1 down+incomplete and 3 incomplete and I'm expecting that these 
won't recover.


So, would love to know what my options are here when all the backfilling has 
finished (or stalled).  Losing data or even entire PGs isn't a big problem as 
this cluster is really just a replica of our main cluster so we can restore 
lost objects manually from there.  Is there a way I can clear 
out/repair/whatever these pgs so I can get a healthy cluster again?


Yes, I know this would have probably been easier with an additional storage 
server and a pool size of 3.  But that's not going to help me right now.




-Original Message-

From: Mark Johnson <



ma...@iovox.com

mailto:ma...@iovox.com>

ma...@iovox.com

>mailto:mark%20johnson%20%3cma...@iovox.com>

mark%20johnson%20%3cma...@iovox.com

mailto:mark%2520johnson%2520%253cma...@iovox.com>

mark%2520johnson%2520%253cma...@iovox.com

>%3e>>

To:



ceph-users@

[ceph-users] Re: Can't get one OSD (out of 14) to start

2021-04-16 Thread Mark Johnson
Querying the problem pgs gives me the following:

1.38:
{
"state": "incomplete",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 2247,
"up": [
17,
4
],
"acting": [
17,
4
],

.

"up": [
14,
6
],
"acting": [
14,
6
],
"primary": 14,
"up_primary": 14
},

 .

"probing_osds": [
"4",
"17",
"22"
],
"down_osds_we_would_probe": [
6
],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}
]

30.c1:
{
"state": "down+incomplete",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 2247,
"up": [
14,
25
],
"acting": [
14,
25
],

..

"up": [
14,
25
],
"acting": [
14,
25
],
"blocked_by": [
6
],

.

"probing_osds": [
"14",
"25"
],
"down_osds_we_would_probe": [
6
],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}
]

Which both mention the "lost" osd 6 in the down_osds_we_would_probe and mention 
of being "blocked by 6".

I've seen this thread - 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html 
which mentions using ceph-objectstore-tool on the primary OSD and mark as 
complete, or another response said to pick a "winner" PG out of those available 
and use ceph-objectstore-tool to remove the other one and "hopefully the winner 
you left alone will allow the pg to recover and go active".  But, I'm kinda 
lost as to whether or not these are correct or not as there's no follow up 
response to say what (if anything) worked.



If I try to query the other two, they just hang there, which is even more 
concerning, and I have to break out at which point I see this:

RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
"pgid": "30.7a"}']": exception 'int' object is not iterable

and

RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
"pgid": "30.8d"}']": exception 'int' object is not iterable

I'm currently able to write some data into Ceph via rados but it appears some 
is failing, presumably due to it wanting to write to the problem pgs.



-Original Message-
From: Mark Johnson 
mailto:mark%20johnson%20%3cma...@iovox.com%3e>>
To: a...@iss-integration.com 
mailto:%2...@iss-integration.com%22%20%3...@iss-integration.com%3e>>
Cc: ceph-users@ceph.io 
mailto:%22ceph-us...@ceph.io%22%20%3cceph-us...@ceph.io%3e>>
Subject: [ceph-users] Re: Can't get one OSD (out of 14) to start
Date: Sat, 17 Apr 2021 02:20:37 +


All the backfill operations are complete and I'm now just left with the 3 
incomplete and 1 down+incomplete:


# ceph health detail

HEALTH_ERR 4 pgs are stuck inactive for more than 300 seconds; 1 pgs down; 4 
pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean; 266 requests are 
blocked > 32 sec; 3 osds have slow requests

pg 1.38 is stuck inactive for 80654.111975, current state incomplete, last 
acting [17,4]

pg 30.7a is stuck inactive for 76259.649932, current state incomplete, last 
acting [12,9]

pg 30.8d is stuck inactive for 76201.794001, current state incomplete, last 
acting [0,5]

pg 30.c1 is stuck inactive for 76305.051390, current state down+incomplete, 
last acting [14,25]

pg 1.38 is stuck unclean for 80654.112037, current state incomplete, last 
acting [17,4]

pg 30.7a is stuck unclean for 76259.649989, current state incomplete, last 
acting [12,9]

pg 30.8d is stuck unclean for 76201.794058, current state incomplete, last 
acting [0,5]

pg 30.c1 is stuck unclean for 76305.051447, current state down+incomplete, last 
acting [14,25]

pg 30.c1 is down+incomplete, acting [14,25]

pg 30.8d is incomplete, acting [0,5]

pg 30.7a is incomplete, acting [12,9]

pg 1.38 is incomplete, acting [17,4]

50 ops are blocked > 33554.4 sec on osd.14

16 ops are blocked > 16777.2 sec on osd.14

2 ops are blocked > 67108.9 sec on osd.12

98 ops are blocked > 33554.4 sec on osd.12

100 ops are blocked > 33554.4 sec on osd.0

3 osds have slow requests



I tried issuing a 'ceph pg repair' to one of those PGs and got the following:


# ceph pg repair 1.38

instructing pg 1.38 on osd.17 to repair


But it doesn't appea