Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Brett Chancellor
For me, it was the .rgw.meta pool that had very dense placement groups. The OSDs would fail to start and would then commit suicide while trying to scan the PGs. We had to remove all references of those placement groups just to get the OSDs to start. It wasn't pretty. On Mon, Aug 19, 2019, 2:09 AM

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Troy Ablan
Yes, it's possible that they do, but since all of the affected OSDs are still down and the monitors have been restarted since, all of those pools have pgs that are in unknown state and don't return anything in ceph pg ls. There weren't that many placement groups for the SSDs, but also I don't

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Brett Chancellor
This sounds familiar. Do any of these pools on the SSD have fairly dense placement group to object ratios? Like more than 500k objects per pg? (ceph pg ls) On Sun, Aug 18, 2019, 10:12 PM Brad Hubbard wrote: > On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan wrote: > > > > Paul, > > > > Thanks for the

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Brad Hubbard
On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan wrote: > > Paul, > > Thanks for the reply. All of these seemed to fail except for pulling > the osdmap from the live cluster. > > -Troy > > -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path > /var/lib/ceph/osd/ceph-45/ --file osdmap45 > terminate

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Troy Ablan
On 8/18/19 6:43 PM, Brad Hubbard wrote: That's this code. 3114 switch (alg) { 3115 case CRUSH_BUCKET_UNIFORM: 3116 size = sizeof(crush_bucket_uniform); 3117 break; 3118 case CRUSH_BUCKET_LIST: 3119 size = sizeof(crush_bucket_list); 3120 break; 3121 case CRUSH_BUCKET_TRE

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Brad Hubbard
On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan wrote: > > Paul, > > Thanks for the reply. All of these seemed to fail except for pulling > the osdmap from the live cluster. > > -Troy > > -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path > /var/lib/ceph/osd/ceph-45/ --file osdmap45 > terminate