Re: [Openstack-operators] [ceph-users] After power outage, nearly all vm volumes corrupted and unmountable

Gary Molenkamp Fri, 06 Jul 2018 06:21:31 -0700

Thank you Jason,  Not sure how I missed that step.


On 2018-07-06 08:34 AM, Jason Dillaman wrote:

There have been several similar reports on the mailing list about this[1][2][3][4] that are always a result of skipping step 6 from theLuminous upgrade guide [5]. The new (starting Luminous) 'profilerbd'-style caps are designed to try to simplify caps going forward [6].

TL;DR: your Openstack CephX users need to have permission to blacklistdead clients that failed to properly release the exclusive lock.

[1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022278.html[2]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022694.html[3]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026496.html

[4] https://www.spinics.net/lists/ceph-users/msg45665.html

[5]http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken[6]http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication

On Fri, Jul 6, 2018 at 7:55 AM Gary Molenkamp <[email protected]<mailto:[email protected]>> wrote:


    Good morning all,

    After losing all power to our DC last night due to a storm, nearly
    all
    of the volumes in our Pike cluster are unmountable.  Of the 30 VMs in
    use at the time, only one has been able to successfully mount and
    boot
    from its rootfs.   We are using Ceph as the backend storage to cinder
    and glance.  Any help or pointers to bring this back online would be
    appreciated.

      What most of the volumes are seeing is

    [    2.622252] SGI XFS with ACLs, security attributes, no debug
    enabled
    [    2.629285] XFS (sda1): Mounting V5 Filesystem
    [    2.832223] sd 2:0:0:0: [sda] FAILED Result: hostbyte=DID_OK
    driverbyte=DRIVER_SENSE
    [    2.838412] sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
    [    2.842383] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
    [    2.846152] sd 2:0:0:0: [sda] CDB: Write(10) 2a 00 00 80 2c 19
    00 04
    00 00
    [    2.850146] blk_update_request: I/O error, dev sda, sector 8399897

    or

    [    2.590178] EXT4-fs (vda1): INFO: recovery required on readonly
    filesystem
    [    2.594319] EXT4-fs (vda1): write access will be enabled during
    recovery
    [    2.957742] print_req_error: I/O error, dev vda, sector 227328
    [    2.962468] Buffer I/O error on dev vda1, logical block 0, lost
    async
    page write
    [    2.967933] Buffer I/O error on dev vda1, logical block 1, lost
    async
    page write
    [    2.973076] print_req_error: I/O error, dev vda, sector 229384

    As a test for one of the less critical vms, I deleted the vm and
    mounted
    the volume on the one VM I managed to start.  The results were not
    promising:


    # dmesg |tail
    [    5.136862] type=1305 audit(1530847244.811:4): audit_pid=496 old=0
    auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0
    res=1
    [    7.726331] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
    [29374.967315] scsi 2:0:0:1: Direct-Access     QEMU     QEMU HARDDISK
    2.5+ PQ: 0 ANSI: 5
    [29374.988104] sd 2:0:0:1: [sdb] 83886080 512-byte logical blocks:
    (42.9
    GB/40.0 GiB)
    [29374.991126] sd 2:0:0:1: Attached scsi generic sg1 type 0
    [29374.995302] sd 2:0:0:1: [sdb] Write Protect is off
    [29374.997109] sd 2:0:0:1: [sdb] Mode Sense: 63 00 00 08
    [29374.997186] sd 2:0:0:1: [sdb] Write cache: enabled, read cache:
    enabled, doesn't support DPO or FUA
    [29375.005968]  sdb: sdb1
    [29375.007746] sd 2:0:0:1: [sdb] Attached SCSI disk

    # parted /dev/sdb
    GNU Parted 3.1
    Using /dev/sdb
    Welcome to GNU Parted! Type 'help' to view a list of commands.
    (parted) p
    Model: QEMU QEMU HARDDISK (scsi)
    Disk /dev/sdb: 42.9GB
    Sector size (logical/physical): 512B/512B
    Partition Table: msdos
    Disk Flags:

    Number  Start   End     Size    Type     File system  Flags
      1      1049kB  42.9GB  42.9GB  primary  xfs          boot

    # mount -t xfs /dev/sdb temp
    mount: wrong fs type, bad option, bad superblock on /dev/sdb,
            missing codepage or helper program, or other error

            In some cases useful info is found in syslog - try
            dmesg | tail or so.

    # xfs_repair /dev/sdb
    Phase 1 - find and verify superblock...
    bad primary superblock - bad magic number !!!

    attempting to find secondary superblock...



    Which eventually fails.   The ceph cluster looks healthy, I can
    export
    the volumes from rbd.  I can find no other errors in ceph of
    openstack
    indicating a fault in either system.

         - Is this recoverable?

         - What happened to all of these volumes and can this be
    prevented
    from occurring again?  Note that any shutdown vm at the time of the
    outage appears to be fine.


    Relevant versions:

         Base OS:  all Centos 7.5

         Ceph:  Luminous 12.2.5-0

         Openstack:  Latest Pike releases in
    centos-release-openstack-pike-1-1

             nova 16.1.4-1

             cinder  11.1.1-1

--Gary Molenkamp Computer Science/Science

    Technology Services
    Systems Administrator           University of Western Ontario
    [email protected] <mailto:[email protected]> http://www.csd.uwo.ca
    (519) 661-2111 x86882           (519) 661-3566

    _______________________________________________
    ceph-users mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jason


--
Gary Molenkamp                  Computer Science/Science Technology Services
Systems Administrator           University of Western Ontario
[email protected]                 http://www.csd.uwo.ca
(519) 661-2111 x86882           (519) 661-3566

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] [ceph-users] After power outage, nearly all vm volumes corrupted and unmountable

Reply via email to