Hi all,
Apologies for necromancing an old thread, but I was wondering if anyone had any
more thoughts on this. We're running v10.2.9 now and still have 3 PGs
exhibiting this behavior in our cache pool after scrubs, deep-scrubs, and
repair attempts. Some more information below.
Thanks much,
Lincoln
[1]
# rados list-inconsistent-obj 36.14f0 | jq
{
"epoch": 820795,
"inconsistents": [
{
"object": {
"name": "1002378e2a6.00000001",
"nspace": "",
"locator": "",
"snap": "head",
"version": 2251698
},
"errors": [],
"union_shard_errors": [
"size_mismatch_oi"
],
"selected_object_info":
"36:0f29a1d4:::1002378e2a6.00000001:head(737930'2208087
client.36346283.1:5757188 dirty
s 4136960 uv 2251698 alloc_hint [0 0])",
"shards": [
{
"osd": 173,
"errors": [
"size_mismatch_oi"
],
"size": 0
},
{
"osd": 242,
"errors": [
"size_mismatch_oi"
],
"size": 0
},
{
"osd": 295,
"errors": [
"size_mismatch_oi"
],
"size": 0
}
]
}
]
}
2017-08-08 13:26:23.243245 7fafac78a700 -1 log_channel(cluster) log [ERR] :
36.2c85 shard 212 missing 36:a13626c6:::1002378e9a9.00000001:head
2017-08-08 13:26:23.243250 7fafac78a700 -1 log_channel(cluster) log [ERR] :
36.2c85 shard 295: soid 36:a13626c6:::1002378e9a9.00000001:head size 0 != size
4173824 from auth oi 36:a13626c6:::1002378e9a9.00000001:head(737930'2123468
client.36346283.1:5782375 dirty s 4173824 uv 2164627 alloc_hint [0 0])
2017-08-08 13:26:23.243253 7fafac78a700 -1 log_channel(cluster) log [ERR] :
36.2c85 shard 353 missing 36:a13626c6:::1002378e9a9.00000001:head
2017-08-08 13:26:23.243255 7fafac78a700 -1 log_channel(cluster) log [ERR] :
36.2c85 soid 36:a13626c6:::1002378e9a9.00000001:head: failed to pick suitable
auth object
2017-08-08 13:26:23.243362 7fafac78a700 -1 log_channel(cluster) log [ERR] :
scrub 36.2c85 36:a13626c6:::1002378e9a9.00000001:head on disk size (0) does not
match object info size (4173824) adjusted for ondisk to (4173824)
2017-08-08 13:26:34.310237 7fafac78a700 -1 log_channel(cluster) log [ERR] :
36.2c85 scrub 4 errors
> On May 15, 2017, at 5:28 PM, Gregory Farnum <[email protected]> wrote:
>
>
>
> On Mon, May 15, 2017 at 3:19 PM Lincoln Bryant <[email protected]> wrote:
> Hi Greg,
>
> Curiously, some of these scrub errors went away on their own. The example pg
> in the original post is now active+clean, and nothing interesting in the logs:
>
> # zgrep "36.277b" ceph-osd.244*gz
> ceph-osd.244.log-20170510.gz:2017-05-09 06:56:40.739855 7f0184623700 0
> log_channel(cluster) log [INF] : 36.277b scrub starts
> ceph-osd.244.log-20170510.gz:2017-05-09 06:58:01.872484 7f0186e28700 0
> log_channel(cluster) log [INF] : 36.277b scrub ok
> ceph-osd.244.log-20170511.gz:2017-05-10 20:40:47.536974 7f0186e28700 0
> log_channel(cluster) log [INF] : 36.277b scrub starts
> ceph-osd.244.log-20170511.gz:2017-05-10 20:41:38.399614 7f0184623700 0
> log_channel(cluster) log [INF] : 36.277b scrub ok
> ceph-osd.244.log-20170514.gz:2017-05-13 20:49:47.063789 7f0186e28700 0
> log_channel(cluster) log [INF] : 36.277b scrub starts
> ceph-osd.244.log-20170514.gz:2017-05-13 20:50:42.085718 7f0186e28700 0
> log_channel(cluster) log [INF] : 36.277b scrub ok
> ceph-osd.244.log-20170515.gz:2017-05-15 00:10:39.417578 7f0184623700 0
> log_channel(cluster) log [INF] : 36.277b scrub starts
> ceph-osd.244.log-20170515.gz:2017-05-15 00:11:26.189777 7f0186e28700 0
> log_channel(cluster) log [INF] : 36.277b scrub ok
>
> (No matches in the logs for osd 175 and osd 297 — perhaps already rotated
> away?)
>
> Other PGs still exhibit this behavior though:
>
> # rados list-inconsistent-obj 36.2953 | jq .
> {
> "epoch": 737940,
> "inconsistents": [
> {
> "object": {
> "name": "1002378da6c.00000001",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 2213621
> },
> "errors": [],
> "union_shard_errors": [
> "size_mismatch_oi"
> ],
> "selected_object_info":
> "36:ca95a23b:::1002378da6c.00000001:head(737930'2177823
> client.36346283.1:5635626 dirty s 4067328 uv 2213621)",
> "shards": [
> {
> "osd": 113,
> "errors": [
> "size_mismatch_oi"
> ],
> "size": 0
> },
> {
> "osd": 123,
> "errors": [
> "size_mismatch_oi"
> ],
> "size": 0
> },
> {
> "osd": 173,
> "errors": [
> "size_mismatch_oi"
> ],
> "size": 0
> }
> ]
> }
> ]
> }
>
> Perhaps new data being written to this pg cleared things up?
>
> Hmm, somebody else did report the same thing (and the symptoms disappearing)
> recently as well. I wonder if we broke the synchronization around eviction
> and scrubbing within cache pools. Sage, you've done work on cache pools
> recently; any thoughts?
> -Greg
>
>
> The only other data point that I can add is that, due to some tweaking of the
> cache tier size before this happened, the cache tier was reporting near full
> / full in `ceph -s` for a brief amount of time (maybe <1hr ?).
>
> Thanks for looking into this.
>
> --Lincoln
>
> > On May 15, 2017, at 4:50 PM, Gregory Farnum <[email protected]> wrote:
> >
> > On Mon, May 1, 2017 at 9:28 AM, Lincoln Bryant <[email protected]>
> > wrote:
> >> Hi all,
> >>
> >> I’ve run across a peculiar issue on 10.2.7. On my 3x replicated cache
> >> tiering cache pool, routine scrubbing suddenly found a bunch of PGs with
> >> size_mismatch_oi errors. From the “rados list-inconsistent-pg tool”[1], I
> >> see that all OSDs are reporting size 0 for a particular pg. I’ve checked
> >> this pg on disk, and it is indeed 0 bytes:
> >> -rw-r--r-- 1 root root 0 Apr 29 06:12
> >> 100235614fe.00000005__head_6E9A677B__24
> >>
> >> I’ve tried re-issuing a scrub, which informs me that the object info size
> >> (2994176) doesn’t match the on-disk size (0) (see [2]). I’ve tried a
> >> repair operation as well to no avail.
> >>
> >> For what it’s worth, this particular cluster is currently migrating
> >> several disks from one CRUSH root to another, and there is a nightly cache
> >> flush/eviction script that is lowering the cache_target_*_ratios before
> >> raising them again in the morning.
> >>
> >> This issue is currently affecting ~10 PGs in my cache pool. Any ideas how
> >> to proceed here?
> >
> > Did anything come from this? It's tickling my brain (especially with
> > the cache pool) but I'm not seeing anything relevant when I search my
> > email.
> >
> >>
> >> Thanks,
> >> Lincoln
> >>
> >> [1]:
> >> {
> >> "epoch": 721312,
> >> "inconsistents": [
> >> {
> >> "object": {
> >> "name": "100235614fe.00000005",
> >> "nspace": "",
> >> "locator": "",
> >> "snap": "head",
> >> "version": 2233551
> >> },
> >> "errors": [],
> >> "union_shard_errors": [
> >> "size_mismatch_oi"
> >> ],
> >> "selected_object_info":
> >> "36:dee65976:::100235614fe.00000005:head(737928'2182216
> >> client.36346283.1:5754260 dirty s 2994176 uv 2233551)",
> >> "shards": [
> >> {
> >> "osd": 175,
> >> "errors": [
> >> "size_mismatch_oi"
> >> ],
> >> "size": 0
> >> },
> >> {
> >> "osd": 244,
> >> "errors": [
> >> "size_mismatch_oi"
> >> ],
> >> "size": 0
> >> },
> >> {
> >> "osd": 297,
> >> "errors": [
> >> "size_mismatch_oi"
> >> ],
> >> "size": 0
> >> }
> >> ]
> >> }
> >> ]
> >> }
> >>
> >> [2]:
> >> 2017-05-01 10:50:13.812992 7f0184623700 0 log_channel(cluster) log [INF]
> >> : 36.277b scrub starts
> >> 2017-05-01 10:51:02.495229 7f0186e28700 -1 log_channel(cluster) log [ERR]
> >> : 36.277b shard 175: soid 36:dee65976:::100235614fe.00000005:head size 0
> >> != size 2994176 from auth oi
> >> 36:dee65976:::100235614fe.00000005:head(737928'2182216
> >> client.36346283.1:5754260 dirty s 2994176 uv 2233551)
> >> 2017-05-01 10:51:02.495234 7f0186e28700 -1 log_channel(cluster) log [ERR]
> >> : 36.277b shard 244: soid 36:dee65976:::100235614fe.00000005:head size 0
> >> != size 2994176 from auth oi
> >> 36:dee65976:::100235614fe.00000005:head(737928'2182216
> >> client.36346283.1:5754260 dirty s 2994176 uv 2233551)
> >> 2017-05-01 10:51:02.495326 7f0186e28700 -1 log_channel(cluster) log [ERR]
> >> : 36.277b shard 297: soid 36:dee65976:::100235614fe.00000005:head size 0
> >> != size 2994176 from auth oi
> >> 36:dee65976:::100235614fe.00000005:head(737928'2182216
> >> client.36346283.1:5754260 dirty s 2994176 uv 2233551)
> >> 2017-05-01 10:51:02.495328 7f0186e28700 -1 log_channel(cluster) log [ERR]
> >> : 36.277b soid 36:dee65976:::100235614fe.00000005:head: failed to pick
> >> suitable auth object
> >> 2017-05-01 10:51:02.495450 7f0186e28700 -1 log_channel(cluster) log [ERR]
> >> : scrub 36.277b 36:dee65976:::100235614fe.00000005:head on disk size (0)
> >> does not match object info size (2994176) adjusted for ondisk to (2994176)
> >> 2017-05-01 10:51:20.223733 7f0184623700 -1 log_channel(cluster) log [ERR]
> >> : 36.277b scrub 4 errors
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> [email protected]
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com