
Thank you for providing me this level of detail.

I ended up just failing the drive since it is still under support and we had in 
fact gotten emails about the health of this drive in the past.

I will however use this in the future if we have an issue with a pg and it is 
the first time  we have had an issue with the drive and/or it's not still under 

Thanks again.


> Hi Shain,
> what i would do:
> take the osd.32 out
> # systemctl stop ceph-osd@32
> # ceph osd out osd.32
> this will cause rebalancing.
> to repair/reuse the drive you can do:
> # smartctl -t long /dev/sdX
> This will start a long self-test on the drive and - i bet - abort this after 
> a while with somethin like
> # smartctl -a /dev/sdX
> [...]
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  
> LBA_first_err [SK ASC ASQ]
>     Description                              number   (hours)
> # 1  Background long   Failed in segment -->       -    4378          
> 35494670 [0x3 0x11 0x0]
> [...]
> Now mark the segmant as "malfunction" - my system was Ubuntu
> # apt install sg3-utils/xenial
> # sg_verify --lba=35494670 /dev/sdX1
> # sg_reassign --address=35494670 /dev/sdX
> # sg_reassign --grown /dev/sdX
> the next long test should hopefully work fine:
> # smartctl -t long /dev/sdX
> If not repeat the above with new found defekt lba.
> Ive done this three time successfully - but not with an error on a primary pg.
> After that you can start the osd with
> # systemctl start ceph-osd@32
> # ceph osd in osd.32
> - Mehmet
>> Brian,
>> Thank you for the detailed information.  I was able to compare the 3
>> hexdump files and it looks like the primary pg is the odd man out.
>> I stopped the OSD and then I attempted to move the object:
>> root@hqosd3:/var/lib/ceph/osd/ceph-32/current/3.2b8_head/DIR_8/DIR_B/DIR_2/DIR_A/DIR_0#
>> mv rb.0.fe307e.238e1f29.00000076024c__head_4650A2B8__3 /root
>> mv: error reading
>> ‘rb.0.fe307e.238e1f29.00000076024c__head_4650A2B8__3’:
>> Input/output error
>> mv: failed to extend
>> ‘/root/rb.0.fe307e.238e1f29.00000076024c__head_4650A2B8__3’:
>> Input/output error
>> However I got a nice Input/output error instead.
>> I assume that this is not the case normally.
>> Any ideas on how I should proceed at this point..should I fail out
>> this OSD and replace the drive (I have had no indication other than
>> the IO error that there is an issue with this disk), or is there
>> something I can try first?
>> Thanks again,
>> Shain
>>> We went through a period of time where we were experiencing these
>>> daily...
>>> cd to the PG directory on each OSD and do a find for
>>> "238e1f29.00000076024c" (mentioned in your error message). This will
>>> likely return a file that has a slash in the name, something like
>>> rbdudata.238e1f29.00000076024c_head_blah_1f...
>>> hexdump -C the object (tab completing the name helps) and pipe the
>>> output to a different location. Once you obtain the hexdumps, do a
>>> diff or cmp against them and find which one is not like the others.
>>> If the primary is not the outlier, perform the PG repair without
>>> worry. If the primary is the outlier, you will need to stop the OSD,
>>> move the object out of place, start it back up and then it will be
>>> okay to issue a PG repair.
>>> Other less common inconsistent PGs we see are differing object sizes
>>> (easy to detect with a simple list of file size) and differing
>>> attributes ("attr -l", but the error logs are usually precise in
>>> identifying the problematic PG copy).
>>>> Hello,
>>>> Ceph status is showing:
>>>> 1 pgs inconsistent
>>>> 1 scrub errors
>>>> 1 active+clean+inconsistent
>>>> I located the error messages in the logfile after querying the pg
>>>> in question:
>>>> root@hqosd3:/var/log/ceph# zgrep -Hn 'ERR' ceph-osd.32.log.1.gz
>>>> ceph-osd.32.log.1.gz:846:2017-03-17 02:25:20.281608 7f7744d7f700
>>>> -1 log_channel(cluster) log [ERR] : 3.2b8 shard 32: soid
>>>> 3/4650a2b8/rb.0.fe307e.238e1f29.00000076024c/head candidate had a
>>>> read error, data_digest 0x84c33490 != known data_digest 0x974a24a7
>>>> from auth shard
>> 62                                                                           
>>>> ceph-osd.32.log.1.gz:847:2017-03-17 02:30:40.264219 7f7744d7f700
>>>> -1 log_channel(cluster) log [ERR] : 3.2b8 deep-scrub 0 missing, 1
>>>> inconsistent
>> objects                                     
>>>> ceph-osd.32.log.1.gz:848:2017-03-17 02:30:40.264307 7f7744d7f700
>>>> -1 log_channel(cluster) log [ERR] : 3.2b8 deep-scrub 1 errors
>>>> Is this a case where it would be safe to use 'ceph pg repair'? The
>>>> documentation indicates there are times where running this command
>>>> is less safe than others...and I would like to be sure before I do
>>>> so.
>>>> Thanks,
>>>> Shain
