On Thu, Jun 5, 2014 at 4:38 AM, Dennis Kramer <den...@holmes.nl> wrote:
> Hi all,
>
> A couple of weeks ago i've upgraded from emperor to firefly.
> I'm using Cloudstack /w CEPH as the storage backend for VMs and templates.

Which versions exactly were you and are you running?

>
> Since the upgrade, ceph is in a HEALTH_ERR with 500+ pgs inconsistent and
> 2000+ scrub errors. Not sure if it has the do with firefly though, but the
> upgrade was the only major change I had.
>
> After the upgrade i've noticed that some of my OSDs were near-full. My
> current ceph setup has two racks defined, each with a couple of hosts. One
> rack was purely for archiving/backup purposes and wasn't that active at all,
> so I've changed the crushmap and moved some hosts from one rack to another.
> I've noticed no problems during this move at all and the cluster was
> rebalancing itself after this change. The current problems I have began
> after the upgrade and the hosts move.
>
> The logs shows messages like:
>
> 2014-06-05 12:09:54.233404 osd.0 [ERR] 9.ac shard 0: soid
> 1e3d14ac/rbd_data.867c0514e5cb0.00000000000000e3/head//9 digest 693024524 !=
> known digest 2075700712
>
> Manual repair with for example "ceph osd repair"

How did you invoke this command?

> doesn't fix the
> inconsistency. I've investigated the rbd image(s) and can pinpoint it to a
> specific VM. When I delete this VM (with the inconsistency pgs in it) from
> ceph and run a deep-scrub again, the inconsistency is gone (makes sense,
> because the rbd image is removed). But when I re-create the VM, I get the
> same inconsistency errors again. The errors are showing in the same ceph
> pool, but different pg. First I thought the base template was the faulty
> image, but even after removing the base VM template and re-creating a new
> template the inconsistencies still occur.
>
> In total I have 8 pools, and the problem exists in at least half of them.
>
> It doesn't look like the osd itself has any problems or has HDD bad sectors.
> The inconsistency is spread over a bunch of different (almost all actually)
> OSDs.
>
> It seems the VMs are running fine though, even with all these inconsistency
> errors, but I'm still worried because I doubt this is a false-positive..
>
> I'm at a loss at the moment and not sure what my next step would be.
> Is there anyone who can shed some light over this issue?

If you're still seeing this, you probably want to compare the objects
directly. When the system reports a bad object, go to each OSD which
stores it, grab the file involved from each replica, and do a manual
diff to see how they compare.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to