Oh, if you were running dev releases, it's not super surprising that the stat 
tracking was at some point buggy.

----- Original Message -----
From: "Dan van der Ster" <d...@vanderster.com>
To: "Samuel Just" <sj...@redhat.com>
Cc: ceph-users@lists.ceph.com
Sent: Thursday, July 23, 2015 8:21:07 AM
Subject: Re: [ceph-users] PGs going inconsistent after stopping the primary

Those pools were a few things: rgw.buckets plus a couple pools we use
for developing new librados clients. But the source of this issue is
likely related to the few pre-hammer development releases (and
crashes) we upgraded through whilst running a large scale test.
Anyway, now I'll know how to better debug this in future so we'll let
you know if it reoccurs.
Cheers, Dan

On Wed, Jul 22, 2015 at 9:42 PM, Samuel Just <sj...@redhat.com> wrote:
> Annoying that we don't know what caused the replica's stat structure to get 
> out of sync.  Let us know if you see it recur.  What were those pools used 
> for?
> -Sam
> ----- Original Message -----
> From: "Dan van der Ster" <d...@vanderster.com>
> To: "Samuel Just" <sj...@redhat.com>
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, July 22, 2015 12:36:53 PM
> Subject: Re: [ceph-users] PGs going inconsistent after stopping the primary
> Cool, writing some objects to the affected PGs has stopped the
> consistent/inconsistent cycle. I'll keep an eye on them but this seems
> to have fixed the problem.
> Thanks!!
> Dan
> On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just <sj...@redhat.com> wrote:
>> Looks like it's just a stat error.  The primary appears to have the correct 
>> stats, but the replica for some reason doesn't (thinks there's an object for 
>> some reason).  I bet it clears itself it you perform a write on the pg since 
>> the primary will send over its stats.  We'd need information from when the 
>> stat error originally occurred to debug further.
>> -Sam
>> ----- Original Message -----
>> From: "Dan van der Ster" <d...@vanderster.com>
>> To: ceph-users@lists.ceph.com
>> Sent: Wednesday, July 22, 2015 7:49:00 AM
>> Subject: [ceph-users] PGs going inconsistent after stopping the primary
>> Hi Ceph community,
>> Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64
>> We wanted to post here before the tracker to see if someone else has
>> had this problem.
>> We have a few PGs (different pools) which get marked inconsistent when
>> we stop the primary OSD. The problem is strange because once we
>> restart the primary, then scrub the PG, the PG is marked active+clean.
>> But inevitably next time we stop the primary OSD, the same PG is
>> marked inconsistent again.
>> There is no user activity on this PG, and nothing interesting is
>> logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line
>> mentioning the PG already says inactive+inconsistent).
>> We suspect this is related to garbage files left in the PG folder. One
>> of our PGs is acting basically like above, except it goes through this
>> cycle: active+clean -> (deep-scrub) -> active+clean+inconsistent ->
>> (repair) -> active+clean -> (restart primary OSD) -> (deep-scrub) ->
>> active+clean+inconsistent. This one at least logs:
>> 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts
>> 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat
>> mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0
>> hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes.
>> 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors
>> and this should be debuggable because there is only one object in the pool:
>>     tapetest               55           0         0        73575G           1
>> even though rados ls returns no objects:
>> # rados ls -p tapetest
>> #
>> Any ideas?
>> Cheers, Dan
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ceph-users mailing list

Reply via email to