Hi Christian,
Thanks for taking the time, I haven't been contacted by anyone yet but managed
to get the down placement groups cleared by exporting 7.4s0 and 7.fs0 and then
marking them as complete on the surviving OSDs:
kvm5c:
ceph-objectstore-tool --op export --pgid 7.4s0 --data-path
/var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file
/var/lib/vz/template/ssd_recovery/osd8_7.4s0.export;
ceph-objectstore-tool --op mark-complete --data-path
/var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --pgid
7.4s0;
kvm5f:
ceph-objectstore-tool --op export --pgid 7.fs0 --data-path
/var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal
--file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export;
ceph-objectstore-tool --op mark-complete --data-path
/var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal
--pgid 7.fs0;
This would presumably simply punch holes in the RBD images but at least we can
copy them out of that pool and hope that Intel can somehow unlock the drives
for us to then export/import objects.
To answer your questions though, we have 6 near identical Intel Wildcat Pass 1U
servers and have Proxmox loaded on them. Proxmox uses a Debian 9 base with the
Ubuntu kernel, for which they apply cherry picked kernel patches (eg Intel NIC
driver updates, vhost perf regression and mem-leak fixes, etc):
kvm5a:
Intel R1208WTTGSR System (serial: BQWS55091014)
Intel S2600WTTR Motherboard (serial: BQWL54950385, BIOS ID:
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT disabled)
24 x Micron 8GB DDR4 2133MHz (24 x 18ASF1G72PZ-2G1B1)
Intel AXX10GBNIA I/O Module
kvm5b:
Intel R1208WTTGS System (serial: BQWS53890178)
Intel S2600WTT Motherboard (serial: BQWL52550359, BIOS ID:
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
kvm5c:
Intel R1208WT2GS System (serial: BQWS50490279)
Intel S2600WT2 Motherboard (serial: BQWL44650203, BIOS ID:
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v3 2.6GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
kvm5d:
Intel R1208WTTGSR System (serial: BQWS62291318)
Intel S2600WTTR Motherboard (serial: BQWL61855187, BIOS ID:
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
kvm5e:
Intel R1208WTTGSR System (serial: BQWS64290162)
Intel S2600WTTR Motherboard (serial: BQWL63953066, BIOS ID:
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
kvm5f:
Intel R1208WTTGSR System (serial: BQWS71790632)
Intel S2600WTTR Motherboard (serial: BQWL71050622, BIOS ID:
SE5C610.86B.01.01.0021.032120170601)
2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
Intel AXX10GBNIA I/O Module
Summary:
* 5b has an Intel S2600WTT, 5c has an Intel S2600WT2, all others have
S2600WTTR Motherboards
* 5a has ECC Registered Dual Rank DDR DIMMs, all others have ECC
LoadReduced-DIMMs
* 5c has an Intel X540-AT2 10 GbE adapter as the on-board NICs are only 1 GbE
Each system has identical discs:
* 2 x 480 GB Intel SSD DC S3610 (SSDSC2BX480G4) - partitioned as software
RAID1 OS volume and Ceph FileStore journals (spinners)
* 4 x 2 TB Seagate discs (ST2000NX0243) - Ceph FileStore OSDs (journals in
S3610 partitions)
* 2 x 1.9 TB Intel SSD DC S4600 (SSDSC2KG019T7) - Ceph BlueStore OSDs
(problematic)
Additional information:
* All drives are directly attached to the on-board AHCI SATA controllers,
via the standard 2.5 inch drive chassis hot-swap bays.
* We added 12 x 1.9 TB SSD DC S4600 drives last week Thursday, 2 in each
system's slots 7 & 8
* Systems have been operating with existing Intel SSD DC S3610 and 2 TB
Seagate discs for over a year; we added the most recent node (kvm5f) on the
23rd of November.
* 6 of the 12 Intel SSD DC S4600 drives failed in less than 100 hours.
* They work perfectly until they suddenly stop responding and are
thereafter, even with us physically shutting down the server and powering it
back up again, completely inaccessible. Intel diagnostic tool reports
'logically locked'.
Drive failures appear random to me:
kvm5a - bay 7 - offline - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM739208851P9DGN
kvm5a - bay 8 - online - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=PHYM727602TM1P9DGN
kvm5b - bay 7 - offline - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=PHYM7276031E1P9DGN
kvm5b - bay 8 - online - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM7392087W1P9DGN
kvm5c - bay 7 - offline - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM739200ZJ1P9DGN
kvm5c - bay 8 - offline - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM7392088B1P9DGN
kvm5d - bay 7 - offline - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM738604Y11P9DGN
kvm5d - bay 8 - online - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=PHYM727603181P9DGN
kvm5e - bay 7 - online - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM7392013B1P9DGN
kvm5e - bay 8 - offline - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM7392087E1P9DGN
kvm5f - bay 7 - online - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM739208721P9DGN
kvm5f - bay 8 - online - Model=INTEL SSDSC2KG019T7, FwRev=SCV10100,
SerialNo=BTYM739208C41P9DGN
Intel SSD Data Center Tool reports:
C:\isdct>isdct.exe show -intelssd
- Intel SSD DC S4600 Series PHYM7276031E1P9DGN -
Bootloader : Property not found
DevicePath : \\\\.\\PHYSICALDRIVE1
DeviceStatus : Selected drive is in a disable logical state.
Firmware : SCV10100
FirmwareUpdateAvailable : Please contact Intel Customer Support for further
assistance at the following website: http://www.intel.com/go/ssdsupport.
Index : 0
ModelNumber : INTEL SSDSC2KG019T7
ProductFamily : Intel SSD DC S4600 Series
SerialNumber : PHYM7276031E1P9DGN
C:\isdct>isdct show -a -intelssd 0
- Intel SSD DC S4600 Series PHYM7276031E1P9DGN -
AccessibleMaxAddressSupported : True
AggregationThreshold : Selected drive is in a disable logical state.
AggregationTime : Selected drive is in a disable logical state.
ArbitrationBurst : Selected drive is in a disable logical state.
BusType : 11
CoalescingDisable : Selected drive is in a disable logical state.
ControllerCompatibleIDs :
PCI\\VEN_8086&DEV_8C02&REV_05PCI\\VEN_8086&DEV_8C02PCI\\VEN_8086&CC_010601PCI\\VEN_8086&CC_0106PCI\\VEN_8086PCI\\CC_010601PCI\\CC_0106
ControllerDescription : @mshdc.inf,%pci\\cc_010601.devicedesc%;Standard SATA
AHCI Controller
ControllerID : PCI\\VEN_8086&DEV_8C02&SUBSYS_78461462&REV_05\\3&11583659&0&FA
ControllerIDEMode : False
ControllerManufacturer : @mshdc.inf,%ms-ahci%;Standard SATA AHCI Controller
ControllerService : storahci
DIPMEnabled : False
DIPMSupported : False
DevicePath : \\\\.\\PHYSICALDRIVE1
DeviceStatus : Selected drive is in a disable logical state.
DigitalFenceSupported : False
DownloadMicrocodePossible : True
DriverDescription : Standard SATA AHCI Controller
DriverMajorVersion : 10
DriverManufacturer : Standard SATA AHCI Controller
DriverMinorVersion : 0
DriverVersion : 10.0.16299.98
DynamicMMIOEnabled : The selected drive does not support this feature.
EnduranceAnalyzer : Selected drive is in a disable logical state.
ErrorString : *BAD_CONTEXT_2020 F4
Firmware : SCV10100
FirmwareUpdateAvailable : Please contact Intel Customer Support for further
assistance at the following website: http://www.intel.com/go/ssdsupport.
HDD : False
HighPriorityWeightArbitration : Selected drive is in a disable logical state.
IEEE1667Supported : False
IOCompletionQueuesRequested : Selected drive is in a disable logical state.
IOSubmissionQueuesRequested : Selected drive is in a disable logical state.
Index : 0
Intel : True
IntelGen3SATA : True
IntelNVMe : False
InterruptVector : Selected drive is in a disable logical state.
IsDualPort : False
LatencyTrackingEnabled : Selected drive is in a disable logical state.
LowPriorityWeightArbitration : Selected drive is in a disable logical state.
Lun : 0
MaximumLBA : 3750748847
MediumPriorityWeightArbitration : Selected drive is in a disable logical state.
ModelNumber : INTEL SSDSC2KG019T7
NVMePowerState : Selected drive is in a disable logical state.
NativeMaxLBA : Selected drive is in a disable logical state.
OEM : Generic
OpalState : Selected drive is in a disable logical state.
PLITestTimeInterval : Selected drive is in a disable logical state.
PNPString : SCSI\\DISK&VEN_INTEL&PROD_SSDSC2KG019T7\\4&2BE6C224&0&010000
PathID : 1
PhySpeed : Selected drive is in a disable logical state.
PhysicalSectorSize : Selected drive is in a disable logical state.
PhysicalSize : 1920383410176
PowerGovernorAveragePower : Selected drive is in a disable logical state.
PowerGovernorBurstPower : Selected drive is in a disable logical state.
PowerGovernorMode : Selected drive is in a disable logical state.
Product : Youngsville
ProductFamily : Intel SSD DC S4600 Series
ProductProtocol : ATA
ReadErrorRecoveryTimer : Selected drive is in a disable logical state.
RemoteSecureEraseSupported : False
SCSIPortNumber : 0
SMARTEnabled : True
SMARTHealthCriticalWarningsConfiguration : Selected drive is in a disable
logical state.
SMARTSelfTestSupported : True
SMBusAddress : Selected drive is in a disable logical state.
SSCEnabled : False
SanitizeBlockEraseSupported : False
SanitizeCryptoScrambleSupported : True
SanitizeSupported : True
SataGen1 : True
SataGen2 : True
SataGen3 : True
SataNegotiatedSpeed : Unknown
SectorSize : 512
SecurityEnabled : False
SecurityFrozen : False
SecurityLocked : False
SecuritySupported : False
SerialNumber : PHYM7276031E1P9DGN
TCGSupported : False
TargetID : 0
TempThreshold : Selected drive is in a disable logical state.
TemperatureLoggingInterval : Selected drive is in a disable logical state.
TimeLimitedErrorRecovery : Selected drive is in a disable logical state.
TrimSize : 4
TrimSupported : True
VolatileWriteCacheEnabled : Selected drive is in a disable logical state.
WWID : 3959312879584368077
WriteAtomicityDisableNormal : Selected drive is in a disable logical state.
WriteCacheEnabled : True
WriteCacheReorderingStateEnabled : Selected drive is in a disable logical state.
WriteCacheState : Selected drive is in a disable logical state.
WriteCacheSupported : True
WriteErrorRecoveryTimer : Selected drive is in a disable logical state.
SMART information is inaccessible, overall status is failed. Herewith the stats
from a partner disc which was still working when the others failed:
Device Model: INTEL SSDSC2KG019T7
Serial Number: PHYM727602TM1P9DGN
LU WWN Device Id: 5 5cd2e4 14e1636bb
Firmware Version: SCV10100
User Capacity: 1,920,383,410,176 bytes [1.92 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Dec 18 19:33:51 2017 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always
- 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 98
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always -
3
170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always
- 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always
- 1
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always
- 0
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always
- 0
175 Program_Fail_Count_Chip 0x0033 100 100 010 Pre-fail Always
- 17567121432
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always
- 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always
- 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0022 077 076 000 Old_age Always
- 23 (Min/Max 17/29)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 0
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always
- 23
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always
- 0
225 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always
- 14195
226 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always
- 0
227 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always
- 42
228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always
- 5905
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always
- 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always
- 0
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always
- 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always
- 14195
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always
- 10422
243 Unknown_Attribute 0x0032 100 100 000 Old_age Always
- 41906
Media wear out : 0% used
LBAs written: 14195
Power on hours: <100
Power cycle count: once at the factory, once at our offices to check if there
was newer firmware (there wasn't) and once when we restarted the node to see if
it could then access a failed drive.
Regards
David Herselman
-----Original Message-----
From: Christian Balzer [mailto:ch...@gol.com]
Sent: Thursday, 21 December 2017 3:24 AM
To: ceph-users@lists.ceph.com
Cc: David Herselman <d...@syrex.co>
Subject: Re: [ceph-users] Many concurrent drive failures - How do I activate
pgs?
Hello,
first off, I don't have anything to add to your conclusions of the current
status, alas there are at least 2 folks here on the ML making a living from
Ceph disaster recovery, so I hope you have been contacted already.
Now once your data is safe or you have a moment, I and others here would
probably be quite interested in some more details, see inline below.
On Wed, 20 Dec 2017 22:25:23 +0000 David Herselman wrote:
[snip]
We've happily been running a 6 node cluster with 4 x FileStore HDDs per node
(journals on SSD partitions) for over a year and recently upgraded all nodes to
Debian 9, Ceph Luminous 12.2.2 and kernel 4.13.8. We ordered 12 x Intel DC
S4600 SSDs which arrived last week so we added two per node on Thursday evening
and brought them up as BlueStore OSDs. We had proactively updated our existing
pools to reference only devices classed as 'hdd', so that we could move select
images over to ssd replicated and erasure coded pools.
Could you tell us more about that cluster, as in HW, how are the SSDs connected
and FW version of the controller if applicable.
Kernel 4.13.8 suggests that this is a handrolled, upstream kernel.
While not necessarily related I'll note that as far as Debian kernels (which
are very lightly if at all patched) are concerned, nothing beyond
4.9 has been working to my satisfaction.
4.11 still worked, but 4.12 crash-reboot-looped on all my Supermicro X10
machines (quite a varied selection).
The current 4.13.13 backport boots on some of those machines, but still throws
errors with the EDAC devices, which works fine with 4.9.
4.14 is known to happily destroy data if used with bcache and even if one
doesn't use that it should give you pause.
We were pretty diligent and downloaded Intel's Firmware Update Tool and
validated that each new drive had the latest available firmware before
installing them in the nodes. We did numerous benchmarks on Friday and
eventually moved some images over to the new storage pools. Everything was
working perfectly and extensive tests on Sunday showed excellent performance.
Sunday night one of the new SSDs died and Ceph replicated and redistributed
data accordingly, then another failed in the early hours of Monday morning and
Ceph did what it needed to.
We had the two failed drives replaced by 11am and Ceph was up to 2/4918587
objects degraded (0.000%) when a third drive failed. At this point we updated
the crush maps for the rbd_ssd and ec_ssd pools and set the device class to
'hdd', to essentially evacuate everything off the SSDs. Other SSDs then failed
at 3:22pm, 4:19pm, 5:49pm and 5:50pm. We've ultimately lost half the Intel
S4600 drives, which are all completely inaccessible. Our status at 11:42pm
Monday night was: 1/1398478 objects unfound (0.000%) and 339/4633062 objects
degraded (0.007%).
The relevant logs when and how those SSDs failed would be interesting.
Was the distribution of the failed SSDs random among the cluster?
Are you running smartd and did it have something to say?
Completely inaccessible sounds a lot like the infamous "self-bricking" of Intel
SSDs when they discover something isn't right, or they don't like the color scheme of the
server inside (^.^).
I'm using quite a lot of Intel SSDs and had only one "fatal" incident.
A DC S3700 detected that its powercap had failed, but of course kept working
fine. Until a reboot was need, when it promptly bricked itself, data
inaccessible, SMART reporting barely that something was there.
So one wonders what caused your SSDs to get their knickers in such a twist.
Are the survivors showing any unusual signs in their SMART output?
Of course what your vendor/Intel will have to say will also be of interest. ^o^
Regards,
Christian