Re: [ceph-users] Disk failed - simulation - but still healthy

Thorvald Hallvardsson Thu, 20 Feb 2014 01:16:37 -0800

We did another simulation yesterday with the benchmark running.

When we detached the drive when the benchmark was running ceph noticed that
straight away and marked osd.6 as down.


So in the first test when we had no IO something must have hit it after 1
hour what spotted osd.6 doesn't exist anymore or rather a hard drive behind
it.

Regards.


On 19 February 2014 15:36, Wido den Hollander <w...@42on.com> wrote:

> On 02/19/2014 02:22 PM, Thorvald Hallvardsson wrote:
>
>> Eventually after 1 hour it spotted that. I took the disk out at 11:06:02
>> so literally 1 hour later:
>> 6       0.9                     osd.6   down    0
>> 7       0.9                     osd.7   up      1
>> 8       0.9                     osd.8   up      1
>>
>> 2014-02-19 12:06:02.802388 mon.0 [INF] osd.6 172.17.12.15:6800/1569
>> <http://172.17.12.15:6800/1569> failed (3 reports from 3 peers after
>>
>> 22.338687 >= grace 20.000000)
>>
>> but 1 hour is a bit ... too long isn't it ?
>>
>>
> The OSD will commit suicide if it encounters to much I/O errors, but it's
> not clear what exactly happened in this case.
>
> I suggest you take a look at the logs of osd.6 to see why it stopped
> working.
>
> Wido
>
>
>
>>
>> On 19 February 2014 11:31, Thorvald Hallvardsson
>> <thorvald.hallvards...@gmail.com
>> <mailto:thorvald.hallvards...@gmail.com>> wrote:
>>
>>     Hi guys,
>>
>>     Quick question. I have a VM with some SCSI drives which act as the
>>     OSDs in my test lab. I have removed the SCSI drive so it's totally
>>     gone from the system, syslog is dropping I/O errors but the cluster
>>     still looks healthy.
>>
>>     Can you tell me why ? I'm trying to reproduce the problem if the
>>     real drive would have failed.
>>
>>     # ll /dev/sd*
>>     brw-rw---- 1 root disk 8,  0 Feb 19 11:13 /dev/sda
>>     brw-rw---- 1 root disk 8,  1 Feb 17 16:45 /dev/sda1
>>     brw-rw---- 1 root disk 8,  2 Feb 17 16:45 /dev/sda2
>>     brw-rw---- 1 root disk 8,  5 Feb 17 16:45 /dev/sda5
>>     brw-rw---- 1 root disk 8, 32 Feb 19 11:13 /dev/sdc
>>     brw-rw---- 1 root disk 8, 33 Feb 17 16:45 /dev/sdc1
>>     brw-rw---- 1 root disk 8, 34 Feb 19 11:11 /dev/sdc2
>>     brw-rw---- 1 root disk 8, 48 Feb 19 11:13 /dev/sdd
>>     brw-rw---- 1 root disk 8, 49 Feb 17 16:45 /dev/sdd1
>>     brw-rw---- 1 root disk 8, 50 Feb 19 11:05 /dev/sdd2
>>
>>
>>     Feb 19 11:06:02 ceph-test-vosd-03 kernel: [586497.813485] sd
>>     2:0:1:0: [sdb] Synchronizing SCSI cache
>>     Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197668] XFS
>>     (sdb1): metadata I/O error: block 0x39e116d3 ("xlog_iodone") error
>>     19 numblks 64
>>     Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197815] XFS
>>     (sdb1): xfs_do_force_shutdown(0x2) called from line 1115 of file
>>     /build/buildd/linux-lts-saucy-3.11.0/fs/xfs/xfs_log.c.  Return
>>     address = 0xffffffffa01e1fe1
>>     Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197823] XFS
>>     (sdb1): Log I/O Error Detected.  Shutting down filesystem
>>     Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197880] XFS
>>     (sdb1): Please umount the filesystem and rectify the problem(s)
>>     Feb 19 11:06:43 ceph-test-vosd-03 kernel: [586538.306817] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:07:13 ceph-test-vosd-03 kernel: [586568.415986] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:07:43 ceph-test-vosd-03 kernel: [586598.525178] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:08:13 ceph-test-vosd-03 kernel: [586628.634356] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:08:43 ceph-test-vosd-03 kernel: [586658.743533] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:09:13 ceph-test-vosd-03 kernel: [586688.852714] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:09:43 ceph-test-vosd-03 kernel: [586718.961903] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:10:13 ceph-test-vosd-03 kernel: [586749.071076] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:10:43 ceph-test-vosd-03 kernel: [586779.180263] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:11:13 ceph-test-vosd-03 kernel: [586809.289440] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:11:44 ceph-test-vosd-03 kernel: [586839.398626] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:12:14 ceph-test-vosd-03 kernel: [586869.507804] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:12:44 ceph-test-vosd-03 kernel: [586899.616988] XFS
>>     (sdb1): xfs_log_force: error 5 returned.
>>     Feb 19 11:12:52 ceph-test-vosd-03 kernel: [586907.848993]
>>     end_request: I/O error, dev fd0, sector 0
>>
>>     mount:
>>     /dev/sdb1 on /var/lib/ceph/osd/ceph-6 type xfs (rw,noatime)
>>     /dev/sdc1 on /var/lib/ceph/osd/ceph-7 type xfs (rw,noatime)
>>     /dev/sdd1 on /var/lib/ceph/osd/ceph-8 type xfs (rw,noatime)
>>
>>     ll /var/lib/ceph/osd/ceph-6
>>     ls: cannot access /var/lib/ceph/osd/ceph-6: Input/output error
>>
>>     -4      2.7             host ceph-test-vosd-03
>>     6       0.9                     osd.6   up      1
>>     7       0.9                     osd.7   up      1
>>     8       0.9                     osd.8   up      1
>>
>>     # ceph-disk list
>>     /dev/fd0 other, unknown
>>     /dev/sda :
>>       /dev/sda1 other, ext2
>>       /dev/sda2 other
>>       /dev/sda5 other, LVM2_member
>>     /dev/sdc :
>>       /dev/sdc1 ceph data, active, cluster ceph, osd.7, journal /dev/sdc2
>>       /dev/sdc2 ceph journal, for /dev/sdc1
>>     /dev/sdd :
>>       /dev/sdd1 ceph data, active, cluster ceph, osd.8, journal /dev/sdd2
>>       /dev/sdd2 ceph journal, for /dev/sdd1
>>
>>          cluster 1a588c94-6f5e-4b04-bc07-f5ce99b91a35
>>           health HEALTH_OK
>>           monmap e7: 3 mons at
>>     {ceph-test-mon-01=172.17.12.11:6789/0,ceph-test-mon-02=
>> 172.17.12.12:6789/0,ceph-test-mon-03=172.17.12.13:6789/0
>>     <http://172.17.12.11:6789/0,ceph-test-mon-02=172.17.12.12:
>> 6789/0,ceph-test-mon-03=172.17.12.13:6789/0>},
>>
>>     election epoch 50, quorum 0,1,2
>>     ceph-test-mon-01,ceph-test-mon-02,ceph-test-mon-03
>>           mdsmap e4: 1/1/1 up {0=ceph-test-admin=up:active}
>>           osdmap e124: 9 osds: 9 up, 9 in
>>            pgmap v1812: 256 pgs, 13 pools, 1522 MB data, 469 objects
>>                  3379 MB used, 8326 GB / 8329 GB avail
>>                       256 active+clean
>>
>>     So as you can see osd.6 is missing but the cluster is happy.
>>
>>     Thank you.
>>
>>     Regards.
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> --
> Wido den Hollander
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disk failed - simulation - but still healthy

Reply via email to