On Thu, Jun 5, 2014 at 4:38 AM, Dennis Kramer <den...@holmes.nl> wrote: > Hi all, > > A couple of weeks ago i've upgraded from emperor to firefly. > I'm using Cloudstack /w CEPH as the storage backend for VMs and templates.
Which versions exactly were you and are you running? > > Since the upgrade, ceph is in a HEALTH_ERR with 500+ pgs inconsistent and > 2000+ scrub errors. Not sure if it has the do with firefly though, but the > upgrade was the only major change I had. > > After the upgrade i've noticed that some of my OSDs were near-full. My > current ceph setup has two racks defined, each with a couple of hosts. One > rack was purely for archiving/backup purposes and wasn't that active at all, > so I've changed the crushmap and moved some hosts from one rack to another. > I've noticed no problems during this move at all and the cluster was > rebalancing itself after this change. The current problems I have began > after the upgrade and the hosts move. > > The logs shows messages like: > > 2014-06-05 12:09:54.233404 osd.0 [ERR] 9.ac shard 0: soid > 1e3d14ac/rbd_data.867c0514e5cb0.00000000000000e3/head//9 digest 693024524 != > known digest 2075700712 > > Manual repair with for example "ceph osd repair" How did you invoke this command? > doesn't fix the > inconsistency. I've investigated the rbd image(s) and can pinpoint it to a > specific VM. When I delete this VM (with the inconsistency pgs in it) from > ceph and run a deep-scrub again, the inconsistency is gone (makes sense, > because the rbd image is removed). But when I re-create the VM, I get the > same inconsistency errors again. The errors are showing in the same ceph > pool, but different pg. First I thought the base template was the faulty > image, but even after removing the base VM template and re-creating a new > template the inconsistencies still occur. > > In total I have 8 pools, and the problem exists in at least half of them. > > It doesn't look like the osd itself has any problems or has HDD bad sectors. > The inconsistency is spread over a bunch of different (almost all actually) > OSDs. > > It seems the VMs are running fine though, even with all these inconsistency > errors, but I'm still worried because I doubt this is a false-positive.. > > I'm at a loss at the moment and not sure what my next step would be. > Is there anyone who can shed some light over this issue? If you're still seeing this, you probably want to compare the objects directly. When the system reports a bad object, go to each OSD which stores it, grab the file involved from each replica, and do a manual diff to see how they compare. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com