[ceph-users] Simultaneous CEPH OSD crashes
Hi, we just had a quasi simultaneous crash on two different OSD which blocked our VMs (min_size = 2, size = 3) on Firefly 0.80.9. the first OSD to go down had this error : 2015-09-27 06:30:33.257133 7f7ac7fef700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f7ac7fef700 time 2015-09-27 06:30:33.145251 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) the second OSD crash was similar : 2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27 06:30:57.260978 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) I'm familiar with this error : it happened already with a BTRFS read error (invalid csum) and I could correct it after flush-journal/deleting the corrupted file/starting OSD/pg repair. This time though there isn't any kernel log indicating an invalid csum. The kernel is different though : we use 3.18.9 on these two servers and the others had 4.0.5 so maybe BTRFS doesn't log invalid checksum errors with this version. I've launched btrfs scrub on the 2 filesystems just in case (still waiting for completion). The first attempt to restart these OSDs failed: one OSD died 19 seconds after start, the other 21 seconds. Seeing that, I temporarily brought down the min_size to 1 which allowed the 9 incomplete PG to recover. I verified this by bringing min_size again to 2 and then restarted the 2 OSDs. They didn't crash yet. For reference the assert failures were still the same when the OSD died shortly after start : 2015-09-27 08:20:19.332835 7f4467bd0700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f4467bd0700 time 2015-09-27 08:20:19.325126 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) 2015-09-27 08:20:50.626344 7f97f2d95700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f97f2d95700 time 2015-09-27 08:20:50.605234 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) Note that at 2015-09-27 06:30:11 a deep-scrub started on a PG involving one (and only one) of these 2 OSD. As we evenly space deep-scrubs (with currently a 10 minute interval), this might be relevant (or just a coincidence). I made copies of the ceph osd logs (including the stack trace and the recent events) if needed. Can anyone put some light on why these OSDs died ? Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Simultaneous CEPH OSD crashes
Le 27/09/2015 09:15, Lionel Bouton a écrit : > Hi, > > we just had a quasi simultaneous crash on two different OSD which > blocked our VMs (min_size = 2, size = 3) on Firefly 0.80.9. > > the first OSD to go down had this error : > > 2015-09-27 06:30:33.257133 7f7ac7fef700 -1 os/FileStore.cc: In function > 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, > size_t, ceph::bufferlist&, bool)' thread 7f7ac7fef700 time 2015-09-27 > 06:30:33.145251 > os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio > || got != -5) > > the second OSD crash was similar : > > 2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function > 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, > size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27 > 06:30:57.260978 > os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio > || got != -5) > > I'm familiar with this error : it happened already with a BTRFS read > error (invalid csum) and I could correct it after flush-journal/deleting > the corrupted file/starting OSD/pg repair. > This time though there isn't any kernel log indicating an invalid csum. > The kernel is different though : we use 3.18.9 on these two servers and > the others had 4.0.5 so maybe BTRFS doesn't log invalid checksum errors > with this version. I've launched btrfs scrub on the 2 filesystems just > in case (still waiting for completion). > > The first attempt to restart these OSDs failed: one OSD died 19 seconds > after start, the other 21 seconds. Seeing that, I temporarily brought > down the min_size to 1 which allowed the 9 incomplete PG to recover. I > verified this by bringing min_size again to 2 and then restarted the 2 > OSDs. They didn't crash yet. > > For reference the assert failures were still the same when the OSD died > shortly after start : > 2015-09-27 08:20:19.332835 7f4467bd0700 -1 os/FileStore.cc: In function > 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, > size_t, ceph::bufferlist&, bool)' thread 7f4467bd0700 time 2015-09-27 > 08:20:19.325126 > os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio > || got != -5) > > 2015-09-27 08:20:50.626344 7f97f2d95700 -1 os/FileStore.cc: In function > 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, > size_t, ceph::bufferlist&, bool)' thread 7f97f2d95700 time 2015-09-27 > 08:20:50.605234 > os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio > || got != -5) > > Note that at 2015-09-27 06:30:11 a deep-scrub started on a PG involving > one (and only one) of these 2 OSD. As we evenly space deep-scrubs (with > currently a 10 minute interval), this might be relevant (or just a > coincidence). > > I made copies of the ceph osd logs (including the stack trace and the > recent events) if needed. > > Can anyone put some light on why these OSDs died ? I just had a thought. Could launching a defragmentation on a file in a BTRFS OSD filestore trigger this problem? We have a process doing just that. It waits until there's no recent access to queue files for defragmentation but there's no guarantee that it will not defragment a file an OSD is about to use. This might explain the nearly simultaneous crash as the defragmentation is triggered by write access patterns which should be the roughly the same on all 3 OSDs hosting a copy of the file. The defragmentation isn't running at the exact same time because it is queued which could explain why we got 2 crashes instead of 3. I'll probably ask on linux-btrfs but the possible conditions leading to this assert failure would help pinpoint the problem, so if someone knows this code well enough without knowing how BTRFS behaves while defragmenting I'll bridge the gap. I just activated autodefrag on one of the two affected servers for all its BTRFS filesystems and disabled our own defragmentation process. With recent tunings we might not need our own defragmentation scheduler anymore and we can afford to lose some performance while investigating this. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Teuthology Integration to native openstack
Hi, We have an openstack deployment in place with CEPH as CINDER backend. We would like to perform functional testing for CEPH and found teuthology as recommended option. Have successfully installed teuthology. Now to integrate it with Openstack, I could see that the possible providers could be either OVH, REDHAT or ENTERCLOUDSITE. Is there any option where in we can source openstack deployment of our own and test CEPH using teuthology? If NO, please suggest on how to test CEPH in such scenarios? Please help. Thank you. Bharath Krishna ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS "corruption" -- Nulled bytes
I've done some digging into cp and mv's semantics (from coreutils). If the inode is existing, the file will get truncated, then data will get copied in. This is definitely within the scope of the bug above. -- Adam On Fri, Sep 25, 2015 at 8:08 PM, Adam Tygart wrote: > It may have been. Although the timestamp on the file was almost a > month ago. The typical workflow for this particular file is to copy an > updated version overtop of it. > > i.e. 'cp qss kstat' > > I'm not sure if cp semantics would keep the same inode and simply > truncate/overwrite the contents, or if it would do an unlink and then > create a new file. > -- > Adam > > On Fri, Sep 25, 2015 at 8:00 PM, Ivo Jimenez wrote: >> Looks like you might be experiencing this bug: >> >> http://tracker.ceph.com/issues/12551 >> >> Fix has been merged to master and I believe it'll be part of infernalis. The >> original reproducer involved truncating/overwriting files. In your example, >> do you know if 'kstat' has been truncated/overwritten prior to generating >> the md5sums? >> >> On Fri, Sep 25, 2015 at 2:11 PM Adam Tygart wrote: >>> >>> Hello all, >>> >>> I've run into some sort of bug with CephFS. Client reads of a >>> particular file return nothing but 40KB of Null bytes. Doing a rados >>> level get of the inode returns the whole file, correctly. >>> >>> Tested via Linux 4.1, 4.2 kernel clients, and the 0.94.3 fuse client. >>> >>> Attached is a dynamic printk debug of the ceph module from the linux >>> 4.2 client while cat'ing the file. >>> >>> My current thought is that there has to be a cache of the object >>> *somewhere* that a 'rados get' bypasses. >>> >>> Even on hosts that have *never* read the file before, it is returning >>> Null bytes from the kernel and fuse mounts. >>> >>> Background: >>> >>> 24x CentOS 7.1 hosts serving up RBD and CephFS with Ceph 0.94.3. >>> CephFS is a EC k=8, m=4 pool with a size 3 writeback cache in front of it. >>> >>> # rados -p cachepool get 10004096b95. /tmp/kstat-cache >>> # rados -p ec84pool get 10004096b95. /tmp/kstat-ec >>> # md5sum /tmp/kstat* >>> ddfbe886420a2cb860b46dc70f4f9a0d /tmp/kstat-cache >>> ddfbe886420a2cb860b46dc70f4f9a0d /tmp/kstat-ec >>> # file /tmp/kstat* >>> /tmp/kstat-cache: Perl script, ASCII text executable >>> /tmp/kstat-ec:Perl script, ASCII text executable >>> >>> # md5sum ~daveturner/bin/kstat >>> 1914e941c2ad5245a23e3e1d27cf8fde /homes/daveturner/bin/kstat >>> # file ~daveturner/bin/kstat >>> /homes/daveturner/bin/kstat: data >>> >>> Thoughts? >>> >>> Any more information you need? >>> >>> -- >>> Adam >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com