Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo


> Il giorno 11 lug 2018, alle ore 23:25, Gregory Farnum  ha 
> scritto:
> 
>> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo 
>>  wrote:
>> OK, I found where the object is:
>> 
>> 
>> ceph osd map cephfs_metadata 200.
>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 
>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)
>> 
>> 
>> So, looking at the osds 23, 35 and 18 logs in fact I see:
>> 
>> 
>> osd.23:
>> 
>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log 
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
>> 10:292cf221:::200.:head
>> 
>> 
>> osd.35:
>> 
>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log 
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
>> 10:292cf221:::200.:head
>> 
>> 
>> osd.18:
>> 
>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log 
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
>> 10:292cf221:::200.:head
>> 
>> 
>> So, basically the same error everywhere.
>> 
>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may 
>> help.
>> 
>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and 
>> no disk problems anywhere. No relevant errors in syslogs, the hosts are 
>> just fine. I cannot exclude an error on the RAID controllers, but 2 of 
>> the OSDs with 10.14 are on a SAN system and one on a different one, so I 
>> would tend to exclude they both had (silent) errors at the same time.
> 
> That's fairly distressing. At this point I'd probably try extracting the 
> object using ceph-objectstore-tool and seeing if it decodes properly as an 
> mds journal. If it does, you might risk just putting it back in place to 
> overwrite the crc.
> 

Ok, I guess I know how to extract the object from a given OSD, but I’m not sure 
how to check if it decodes as mds journal, is there a procedure for this? 
However if trying to export all the sophie’s from all the osd brings the same 
object md5sum I believe I can try directly to overwrite the object, as it 
cannot go worse than this, correct?
Also I’d need a confirmation of the procedure to follow in this case, when 
possibly all copies of an object are wrong, I would try the following:

- set the noout
- bring down all the osd where the object is present
- replace the object in all stores
- bring the osds up again
- unset the noout

Correct?


> However, I'm also quite curious how it ended up that way, with a checksum 
> mismatch but identical data (and identical checksums!) across the three 
> replicas. Have you previously done some kind of scrub repair on the metadata 
> pool?

No, at least not on this pg, I only remember of a repair but it was on a 
different pool.

> Did the PG perhaps get backfilled due to cluster changes?

That might be the case, as we have to reboot the osds sometimes when they 
crash. Also, yesterday we rebooted all of them, but this happens always in 
sequence, one by one, not all at the same time.
Thanks for the help,

   Alessandro

> -Greg
>  
>> 
>> Thanks,
>> 
>> 
>>  Alessandro
>> 
>> 
>> 
>> Il 11/07/18 18:56, John Spray ha scritto:
>> > On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
>> >  wrote:
>> >> Hi John,
>> >>
>> >> in fact I get an I/O error by hand too:
>> >>
>> >>
>> >> rados get -p cephfs_metadata 200. 200.
>> >> error getting cephfs_metadata/200.: (5) Input/output error
>> > Next step would be to go look for corresponding errors on your OSD
>> > logs, system logs, and possibly also check things like the SMART
>> > counters on your hard drives for possible root causes.
>> >
>> > John
>> >
>> >
>> >
>> >>
>> >> Can this be recovered someway?
>> >>
>> >> Thanks,
>> >>
>> >>
>> >>   Alessandro
>> >>
>> >>
>> >> Il 11/07/18 18:33, John Spray ha scritto:
>> >>> On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
>> >>>  wrote:
>>  Hi,
>> 
>>  after the upgrade to luminous 12.2.6 today, all our MDSes have been
>>  marked as damaged. Trying to restart the instances only result in
>>  standby MDSes. We currently have 2 filesystems active and 2 MDSes each.
>> 
>>  I found the following error messages in the mon:
>> 
>> 
>>  mds.0 :6800/2412911269 down:damaged
>>  mds.1 :6800/830539001 down:damaged
>>  mds.0 :6800/4080298733 down:damaged
>> 
>> 
>>  Whenever I try to force the repaired state with ceph mds repaired
>>  : I get something like this in the MDS logs:
>> 
>> 
>>  2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
>>  error getting journal off disk
>>  2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
>>  [ERR] : Error recovering journal 0x201: (5) Input/output error
>> >>> An EIO reading the journal header is pretty scary.  The MDS itself
>> >>> probably can't tell you much more about this:

Re: [ceph-users] SSDs for data drives

2018-07-12 Thread Adrian Saul

We started our cluster with consumer (Samsung EVO) disks and the write 
performance was pitiful, they had periodic spikes in latency (average of 8ms, 
but much higher spikes) and just did not perform anywhere near where we were 
expecting.

When replaced with SM863 based devices the difference was night and day.  The 
DC grade disks held a nearly constant low latency (contantly sub-ms), no 
spiking and performance was massively better.   For a period I ran both disks 
in the cluster and was able to graph them side by side with the same workload.  
This was not even a moderately loaded cluster so I am glad we discovered this 
before we went full scale.

So while you certainly can do cheap and cheerful and let the data availability 
be handled by Ceph, don’t expect the performance to keep up.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Satish 
Patel
Sent: Wednesday, 11 July 2018 10:50 PM
To: Paul Emmerich 
Cc: ceph-users 
Subject: Re: [ceph-users] SSDs for data drives

Prices going way up if I am picking Samsung SM863a for all data drives.

We have many servers running on consumer grade sad drives and we never noticed 
any performance or any fault so far (but we never used ceph before)

I thought that is the whole point of ceph to provide high availability if drive 
go down also parellel read from multiple osd node

Sent from my iPhone

On Jul 11, 2018, at 6:57 AM, Paul Emmerich 
mailto:paul.emmer...@croit.io>> wrote:
Hi,

we‘ve no long-term data for the SM variant.
Performance is fine as far as we can tell, but the main difference between 
these two models should be endurance.


Also, I forgot to mention that my experiences are only for the 1, 2, and 4 TB 
variants. Smaller SSDs are often proportionally slower (especially below 500GB).

Paul

Robert Stanford mailto:rstanford8...@gmail.com>>:
Paul -

 That's extremely helpful, thanks.  I do have another cluster that uses Samsung 
SM863a just for journal (spinning disks for data).  Do you happen to have an 
opinion on those as well?

On Wed, Jul 11, 2018 at 4:03 AM, Paul Emmerich 
mailto:paul.emmer...@croit.io>> wrote:
PM/SM863a are usually great disks and should be the default go-to option, they 
outperform
even the more expensive PM1633 in our experience.
(But that really doesn't matter if it's for the full OSD and not as dedicated 
WAL/journal)

We got a cluster with a few hundred SanDisk Ultra II (discontinued, i believe) 
that was built on a budget.
Not the best disk but great value. They have been running since ~3 years now 
with very few failures and
okayish overall performance.

We also got a few clusters with a few hundred SanDisk Extreme Pro, but we are 
not yet sure about their
long-time durability as they are only ~9 months old (average of ~1000 write 
IOPS on each disk over that time).
Some of them report only 50-60% lifetime left.

For NVMe, the Intel NVMe 750 is still a great disk

Be carefuly to get these exact models. Seemingly similar disks might be just 
completely bad, for
example, the Samsung PM961 is just unusable for Ceph in our experience.

Paul

2018-07-11 10:14 GMT+02:00 Wido den Hollander 
mailto:w...@42on.com>>:


On 07/11/2018 10:10 AM, Robert Stanford wrote:
>
>  In a recent thread the Samsung SM863a was recommended as a journal
> SSD.  Are there any recommendations for data SSDs, for people who want
> to use just SSDs in a new Ceph cluster?
>

Depends on what you are looking for, SATA, SAS3 or NVMe?

I have very good experiences with these drives running with BlueStore in
them in SuperMicro machines:

- SATA: Samsung PM863a
- SATA: Intel S4500
- SAS: Samsung PM1633
- NVMe: Samsung PM963

Running WAL+DB+DATA with BlueStore on the same drives.

Wido

>  Thank you
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 
31h
81247 
München
www.croit.io
Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. I

Re: [ceph-users] unfound blocks IO or gives IO error?

2018-07-12 Thread Dan van der Ster
On Wed, Jul 11, 2018 at 11:40 PM Gregory Farnum  wrote:
>
> On Mon, Jun 25, 2018 at 12:34 AM Dan van der Ster  wrote:
>>
>> On Fri, Jun 22, 2018 at 10:44 PM Gregory Farnum  wrote:
>> >
>> > On Fri, Jun 22, 2018 at 6:22 AM Sergey Malinin  wrote:
>> >>
>> >> From 
>> >> http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/ 
>> >> :
>> >>
>> >> "Now 1 knows that these object exist, but there is no live ceph-osd who 
>> >> has a copy. In this case, IO to those objects will block, and the cluster 
>> >> will hope that the failed node comes back soon; this is assumed to be 
>> >> preferable to returning an IO error to the user."
>> >
>> >
>> > This is definitely the default and the way I recommend you run a cluster. 
>> > But do keep in mind sometimes other layers in your stack have their own 
>> > timeouts and will start throwing errors if the Ceph library doesn't return 
>> > an IO quickly enough. :)
>>
>> Right, that's understood. This is the nice behaviour of virtio-blk vs
>> virtio-scsi: the latter has a timeout but blk blocks forever.
>> On 5000 attached volumes we saw around 12 of these IO errors, and this
>> was the first time in 5 years of upgrades that an IO error happened...
>
>
> Did you ever get more info about this? An unexpected EIO return-to-clients 
> turned up on the mailing list today (http://tracker.ceph.com/issues/24875) 
> but in a brief poke around I didn't see anything about missing objects doing 
> so.

Not really. We understood *why* we had flapping osds following the
upgrade -- it was due to us having 'mon osd report timeout = 60'
(default 900), a setting we had in jewel which was a workaround for
some strange network issues we had in our data centre. It turns out
that in luminous this setting is ultra dangerous -- the osds don't
report pgstats back to the mon anymore so the mon starts marks osds
down every 60s. The resulting flapping led to some momentarily unfound
objects, and that is when we saw the EIO on the clients.

In the days following the upgrade, deep-scrub did find a handful of
inconsistent objects, e.g.

2018-06-25 20:41:18.070684 7f78580af700 -1 log_channel(cluster) log
[ERR] : 4.1e0 : soid
4:078dcd53:::rbd_data.4c50bf229fbf77.00011ec6:head data_digest
0xd3329392 != data_digest 0x8a882df4 from shard 143
2018-06-25 21:07:14.157514 7f78580af700 -1 log_channel(cluster) log
[ERR] : 4.1e0 repair 0 missing, 1 inconsistent objects
2018-06-25 21:07:14.157952 7f78580af700 -1 log_channel(cluster) log
[ERR] : 4.1e0 repair 1 errors, 1 fixed

But I didn't find any corresponding crc errors of reads from those
objects before they were found to be inconsistent.
And no IO errors since the upgrade...

Alessandro's issue sounds pretty scary.

-- Dan


> -Greg
>
>>
>>
>> -- dan
>>
>>
>> > -Greg
>> >
>> >>
>> >>
>> >> On 22.06.2018, at 16:16, Dan van der Ster  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> Quick question: does an IO with an unfound object result in an IO
>> >> error or should the IO block?
>> >>
>> >> During a jewel to luminous upgrade some PGs passed through a state
>> >> with unfound objects for a few seconds. And this seems to match the
>> >> times when we had a few IO errors on RBD attached volumes.
>> >>
>> >> Wondering what is the correct behaviour here...
>> >>
>> >> Cheers, Dan
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2018-07-12 Thread Dan van der Ster
On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum  wrote:
>
> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo 
>  wrote:
>>
>> OK, I found where the object is:
>>
>>
>> ceph osd map cephfs_metadata 200.
>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)
>>
>>
>> So, looking at the osds 23, 35 and 18 logs in fact I see:
>>
>>
>> osd.23:
>>
>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
>> 10:292cf221:::200.:head
>>
>>
>> osd.35:
>>
>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
>> 10:292cf221:::200.:head
>>
>>
>> osd.18:
>>
>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
>> 10:292cf221:::200.:head
>>
>>
>> So, basically the same error everywhere.
>>
>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may
>> help.
>>
>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and
>> no disk problems anywhere. No relevant errors in syslogs, the hosts are
>> just fine. I cannot exclude an error on the RAID controllers, but 2 of
>> the OSDs with 10.14 are on a SAN system and one on a different one, so I
>> would tend to exclude they both had (silent) errors at the same time.
>
>
> That's fairly distressing. At this point I'd probably try extracting the 
> object using ceph-objectstore-tool and seeing if it decodes properly as an 
> mds journal. If it does, you might risk just putting it back in place to 
> overwrite the crc.
>

Wouldn't it be easier to scrub repair the PG to fix the crc?

Alessandro, did you already try a deep-scrub on pg 10.14? I expect
it'll show an inconsistent object. Though, I'm unsure if repair will
correct the crc given that in this case *all* replicas have a bad crc.

--Dan

> However, I'm also quite curious how it ended up that way, with a checksum 
> mismatch but identical data (and identical checksums!) across the three 
> replicas. Have you previously done some kind of scrub repair on the metadata 
> pool? Did the PG perhaps get backfilled due to cluster changes?
> -Greg
>
>>
>>
>> Thanks,
>>
>>
>>  Alessandro
>>
>>
>>
>> Il 11/07/18 18:56, John Spray ha scritto:
>> > On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
>> >  wrote:
>> >> Hi John,
>> >>
>> >> in fact I get an I/O error by hand too:
>> >>
>> >>
>> >> rados get -p cephfs_metadata 200. 200.
>> >> error getting cephfs_metadata/200.: (5) Input/output error
>> > Next step would be to go look for corresponding errors on your OSD
>> > logs, system logs, and possibly also check things like the SMART
>> > counters on your hard drives for possible root causes.
>> >
>> > John
>> >
>> >
>> >
>> >>
>> >> Can this be recovered someway?
>> >>
>> >> Thanks,
>> >>
>> >>
>> >>   Alessandro
>> >>
>> >>
>> >> Il 11/07/18 18:33, John Spray ha scritto:
>> >>> On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
>> >>>  wrote:
>>  Hi,
>> 
>>  after the upgrade to luminous 12.2.6 today, all our MDSes have been
>>  marked as damaged. Trying to restart the instances only result in
>>  standby MDSes. We currently have 2 filesystems active and 2 MDSes each.
>> 
>>  I found the following error messages in the mon:
>> 
>> 
>>  mds.0 :6800/2412911269 down:damaged
>>  mds.1 :6800/830539001 down:damaged
>>  mds.0 :6800/4080298733 down:damaged
>> 
>> 
>>  Whenever I try to force the repaired state with ceph mds repaired
>>  : I get something like this in the MDS logs:
>> 
>> 
>>  2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
>>  error getting journal off disk
>>  2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
>>  [ERR] : Error recovering journal 0x201: (5) Input/output error
>> >>> An EIO reading the journal header is pretty scary.  The MDS itself
>> >>> probably can't tell you much more about this: you need to dig down
>> >>> into the RADOS layer.  Try reading the 200. object (that
>> >>> happens to be the rank 0 journal header, every CephFS filesystem
>> >>> should have one) using the `rados` command line tool.
>> >>>
>> >>> John
>> >>>
>> >>>
>> >>>
>>  Any attempt of running the journal export results in errors, like this 
>>  one:
>> 
>> 
>>  cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
>>  Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1
>>  Header 200. is unreadable
>> 
>>  2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not
>>  readable, attempt object-by-object dump with `rados`
>> 
>> 
>>  Same happens for re

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo



Il 12/07/18 10:58, Dan van der Ster ha scritto:

On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum  wrote:

On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo 
 wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in fact I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may
help.

No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and
no disk problems anywhere. No relevant errors in syslogs, the hosts are
just fine. I cannot exclude an error on the RAID controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one on a different one, so I
would tend to exclude they both had (silent) errors at the same time.


That's fairly distressing. At this point I'd probably try extracting the object 
using ceph-objectstore-tool and seeing if it decodes properly as an mds 
journal. If it does, you might risk just putting it back in place to overwrite 
the crc.


Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep scrub, but 
I'm not sure it could repair in case all replicas are bad, as it seems 
to be the case.




Alessandro, did you already try a deep-scrub on pg 10.14?


I'm waiting for the cluster to do that, I've sent it earlier this morning.


  I expect
it'll show an inconsistent object. Though, I'm unsure if repair will
correct the crc given that in this case *all* replicas have a bad crc.


Exactly, this is what I wonder too.
Cheers,

    Alessandro



--Dan


However, I'm also quite curious how it ended up that way, with a checksum 
mismatch but identical data (and identical checksums!) across the three 
replicas. Have you previously done some kind of scrub repair on the metadata 
pool? Did the PG perhaps get backfilled due to cluster changes?
-Greg



Thanks,


  Alessandro



Il 11/07/18 18:56, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
 wrote:

Hi John,

in fact I get an I/O error by hand too:


rados get -p cephfs_metadata 200. 200.
error getting cephfs_metadata/200.: (5) Input/output error

Next step would be to go look for corresponding errors on your OSD
logs, system logs, and possibly also check things like the SMART
counters on your hard drives for possible root causes.

John




Can this be recovered someway?

Thanks,


   Alessandro


Il 11/07/18 18:33, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
 wrote:

Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have been
marked as damaged. Trying to restart the instances only result in
standby MDSes. We currently have 2 filesystems active and 2 MDSes each.

I found the following error messages in the mon:


mds.0 :6800/2412911269 down:damaged
mds.1 :6800/830539001 down:damaged
mds.0 :6800/4080298733 down:damaged


Whenever I try to force the repaired state with ceph mds repaired
: I get something like this in the MDS logs:


2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
[ERR] : Error recovering journal 0x201: (5) Input/output error

An EIO reading the journal header is pretty scary.  The MDS itself
probably can't tell you much more about this: you need to dig down
into the RADOS layer.  Try reading the 200. object (that
happens to be the rank 0 journal header, every CephFS filesystem
should have one) using the `rados` command line tool.

John




Any attempt of running the journal export results in errors, like this one:


cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1
Header 200. is unreadable

2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not
readable, attempt object-by-object dump with `rados`


Same happens for recover_dentries

cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
200. is unreadable
Errors:
0

Is there something I could try to do to have the cluster back?

I was able to dump the con

Re: [ceph-users] RADOSGW err=Input/output error

2018-07-12 Thread Will Zhao
Hi :

I use libs3 to run test . The network is IB.
The error in libcurl is the following:

== Info: Operation too slow. Less than 1 bytes/sec transferred the last 15
seconds

== Info: Closing connection 766

and a full request error in rgw is as the following:

2018-07-12 15:42:30.501074 7fe8bc83f700  1 civetweb: 0x7fe940ebf000:
10.5.131.193 - - [12/Jul/2018:15:42:15 +0800] "PUT /yesonggao.hdd77/2217856
HTTP/1.1" 1 0 - Mozilla/4.0 (Compatible; s3; libs3 2.0; Linux x86_64)
2018-07-12 15:43:05.332318 7fe8bc83f700 20 HTTP_HOST=localhost
2018-07-12 15:43:05.332326 7fe8bc83f700 20 HTTP_VERSION=1.1
2018-07-12 15:43:05.332327 7fe8bc83f700 20 REMOTE_ADDR=10.5.131.193
2018-07-12 15:43:05.332328 7fe8bc83f700 20 REQUEST_METHOD=HEAD
2018-07-12 15:43:05.332330 7fe8bc83f700 20 REQUEST_URI=/
2018-07-12 15:43:05.332331 7fe8bc83f700 20 SCRIPT_URI=/
2018-07-12 15:43:05.332331 7fe8bc83f700 20 SERVER_PORT=80
2018-07-12 15:43:05.332333 7fe8bc83f700  1 == starting new request
req=0x7fe8bc839110 =
2018-07-12 15:43:05.332344 7fe8bc83f700  2 req 18326229:0.10::HEAD
/::initializing for trans_id =
tx00117a2d5-005b470689-d41b-default
2018-07-12 15:43:05.332351 7fe8bc83f700 10 rgw api priority: s3=5
s3website=4
2018-07-12 15:43:05.332352 7fe8bc83f700 10 host=localhost
2018-07-12 15:43:05.332354 7fe8bc83f700 20 subdomain= domain=
in_hosted_domain=0 in_hosted_domain_s3website=0
2018-07-12 15:43:05.332355 7fe8bc83f700 20 final domain/bucket subdomain=
domain= in_hosted_domain=0 in_hosted_domain_s3website=0 s->info.domain=
s->info.request_uri=/
2018-07-12 15:43:05.332367 7fe8bc83f700 20 get_handler
handler=26RGWHandler_REST_Service_S3
2018-07-12 15:43:05.332370 7fe8bc83f700 10
handler=26RGWHandler_REST_Service_S3
2018-07-12 15:43:05.332371 7fe8bc83f700  2 req 18326229:0.38:s3:HEAD
/::getting op 3
2018-07-12 15:43:05.332373 7fe8bc83f700 10 op=26RGWListBuckets_ObjStore_S3
2018-07-12 15:43:05.332374 7fe8bc83f700  2 req 18326229:0.41:s3:HEAD
/:list_buckets:verifying requester
2018-07-12 15:43:05.332376 7fe8bc83f700 20
rgw::auth::StrategyRegistry::s3_main_strategy_t: trying
rgw::auth::s3::AWSAuthStrategy
2018-07-12 15:43:05.332378 7fe8bc83f700 20 rgw::auth::s3::AWSAuthStrategy:
trying rgw::auth::s3::S3AnonymousEngine
2018-07-12 15:43:05.332381 7fe8bc83f700 20 rgw::auth::s3::S3AnonymousEngine
granted access
2018-07-12 15:43:05.332383 7fe8bc83f700 20 rgw::auth::s3::AWSAuthStrategy
granted access
2018-07-12 15:43:05.332384 7fe8bc83f700  2 req 18326229:0.51:s3:HEAD
/:list_buckets:normalizing buckets and tenants
2018-07-12 15:43:05.332386 7fe8bc83f700 10 s->object= s->bucket=
2018-07-12 15:43:05.332387 7fe8bc83f700  2 req 18326229:0.54:s3:HEAD
/:list_buckets:init permissions
2018-07-12 15:43:05.332400 7fe8bc83f700  2 req 18326229:0.66:s3:HEAD
/:list_buckets:recalculating target
2018-07-12 15:43:05.332402 7fe8bc83f700  2 req 18326229:0.68:s3:HEAD
/:list_buckets:reading permissions
2018-07-12 15:43:05.332403 7fe8bc83f700  2 req 18326229:0.70:s3:HEAD
/:list_buckets:init op
[root@SH-IDC1-10-5-30-221 ceph]# grep 7fe8bc83f700
ceph-client.rgw.SH-IDC1-10-5-30-221.log | grep http_status=500 -B 50 -A 50
2018-07-12 15:42:15.499588 7fe8bc83f700 10 get_canon_resource():
dest=/yesonggao.hdd77/2217856
2018-07-12 15:42:15.499589 7fe8bc83f700 10 string_to_sign:
2018-07-12 15:42:15.499611 7fe8bc83f700 15 string_to_sign=PUT
2018-07-12 15:42:15.499620 7fe8bc83f700 15 server
signature=14+b6nTNy3cPjVxFlqhRY+hahsA=
2018-07-12 15:42:15.499620 7fe8bc83f700 15 client
signature=14+b6nTNy3cPjVxFlqhRY+hahsA=
2018-07-12 15:42:15.499621 7fe8bc83f700 15 compare=0
2018-07-12 15:42:15.499623 7fe8bc83f700 20 rgw::auth::s3::LocalEngine
granted access
2018-07-12 15:42:15.499624 7fe8bc83f700 20 rgw::auth::s3::AWSAuthStrategy
granted access
2018-07-12 15:42:15.499626 7fe8bc83f700  2 req 18325783:0.000112:s3:PUT
/yesonggao.hdd77/2217856:put_obj:normalizing buckets and tenants
2018-07-12 15:42:15.499628 7fe8bc83f700 10 s->object=2217856
s->bucket=yesonggao.hdd77
2018-07-12 15:42:15.499630 7fe8bc83f700  2 req 18325783:0.000116:s3:PUT
/yesonggao.hdd77/2217856:put_obj:init permissions
2018-07-12 15:42:15.499642 7fe8bc83f700 15 decode_policy Read
AccessControlPolicyhttp://s3.amazonaws.com/doc/2006-03-01/";>yesonggaoyesonggaohttp://www.w3.org/2001/XMLSchema-instance";
xsi:type="CanonicalUser">yesonggaoyesonggaoFULL_CONTROL
2018-07-12 15:42:15.499658 7fe8bc83f700  2 req 18325783:0.000144:s3:PUT
/yesonggao.hdd77/2217856:put_obj:recalculating target
2018-07-12 15:42:15.499662 7fe8bc83f700  2 req 18325783:0.000148:s3:PUT
/yesonggao.hdd77/2217856:put_obj:reading permissions
2018-07-12 15:42:15.499664 7fe8bc83f700  2 req 18325783:0.000150:s3:PUT
/yesonggao.hdd77/2217856:put_obj:init op
2018-07-12 15:42:15.499666 7fe8bc83f700  2 req 18325783:0.000152:s3:PUT
/yesonggao.hdd77/2217856:put_obj:verifying op mask
2018-07-12 15:42:15.499667 7fe8bc83f700 20 required_mask= 2 user.op_mask=7
2018-07-12 15:42:15.499668 7fe8bc83f700  2 req 18325783:0.000154:s3:P

Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.

2018-07-12 Thread Magnus Grönlund
Hi list,

Things went from bad to worse, tried to upgrade some OSDs to Luminous to
see if that could help but that didn’t appear to make any difference.
But for each restarted OSD there was a few PGs that the OSD seemed to
“forget” and the number of undersized PGs grew until some PGs had been
“forgotten” by all 3 acting OSDs and became stale, even though all OSDs
(and their disks) where available.
Then the OSDs grew so big that the servers ran out of memory (48GB per
server with 10 2TB-disks per server) and started killing the OSDs…
All OSDs where then shutdown to try and preserve some data on the disks at
least, but maybe it is too late?

/Magnus

2018-07-11 21:10 GMT+02:00 Magnus Grönlund :

> Hi Paul,
>
> No all OSDs are still jewel , the issue started before I had even started
> to upgrade the first OSD and they don't appear to be flapping.
> ceph -w shows a lot of slow request etc, but nothing unexpected as far as
> I can tell considering the state the cluster is in.
>
> 2018-07-11 20:40:09.396642 osd.37 [WRN] 100 slow requests, 2 included
> below; oldest blocked for > 25402.278824 secs
> 2018-07-11 20:40:09.396652 osd.37 [WRN] slow request 1920.957326 seconds
> old, received at 2018-07-11 20:08:08.439214: osd_op(client.73540057.0:8289463
> 2.e57b3e32 (undecoded) ack+ondisk+retry+write+known_if_redirected
> e160294) currently waiting for peered
> 2018-07-11 20:40:09.396660 osd.37 [WRN] slow request 1920.048094 seconds
> old, received at 2018-07-11 20:08:09.348446: osd_op(client.671628641.0:998704
> 2.42f88232 (undecoded) ack+ondisk+retry+write+known_if_redirected
> e160475) currently waiting for peered
> 2018-07-11 20:40:10.397008 osd.37 [WRN] 100 slow requests, 2 included
> below; oldest blocked for > 25403.279204 secs
> 2018-07-11 20:40:10.397017 osd.37 [WRN] slow request 1920.043860 seconds
> old, received at 2018-07-11 20:08:10.353060: osd_op(client.231731103.0:1007729
> 3.e0ff5786 (undecoded) ondisk+write+known_if_redirected e137428)
> currently waiting for peered
> 2018-07-11 20:40:10.397023 osd.37 [WRN] slow request 1920.034101 seconds
> old, received at 2018-07-11 20:08:10.362819: osd_op(client.207458703.0:2000292
> 3.a8143b86 (undecoded) ondisk+write+known_if_redirected e137428)
> currently waiting for peered
> 2018-07-11 20:40:10.790573 mon.0 [INF] pgmap 4104 pgs: 5 down+peering,
> 1142 peering, 210 remapped+peering, 5 active+recovery_wait+degraded, 1551
> active+clean, 2 activating+undersized+degraded+remapped, 15
> active+remapped+backfilling, 178 unknown, 1 active+remapped, 3
> activating+remapped, 78 active+undersized+degraded+remapped+backfill_wait,
> 6 active+recovery_wait+degraded+remapped, 3 
> undersized+degraded+remapped+backfill_wait+peered,
> 5 active+undersized+degraded+remapped+backfilling, 295
> active+remapped+backfill_wait, 3 active+recovery_wait+undersized+degraded,
> 21 activating+undersized+degraded, 559 active+undersized+degraded, 4
> remapped, 17 undersized+degraded+peered, 1 
> active+recovery_wait+undersized+degraded+remapped;
> 13439 GB data, 42395 GB used, 160 TB / 201 TB avail; 4069 B/s rd, 746 kB/s
> wr, 5 op/s; 534753/10756032 objects degraded (4.972%); 779027/10756032
> objects misplaced (7.243%); 256 MB/s, 65 objects/s recovering
>
>
>
> There are a lot of things in the OSD-log files that I'm unfamiliar with
> but so far I haven't found anything that has given me a clue on how to fix
> the issue.
> BTW restarting a OSD doesn't seem to help, on the contrary, that sometimes
> results in PGs beeing stuck undersized!
> I have attaced a osd-log from when a OSD i restarted started up.
>
> Best regards
> /Magnus
>
>
> 2018-07-11 20:39 GMT+02:00 Paul Emmerich :
>
>> Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is
>> there anything weird in the OSDs' log files?
>>
>>
>> Paul
>>
>> 2018-07-11 20:30 GMT+02:00 Magnus Grönlund :
>>
>>> Hi,
>>>
>>> Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous
>>> (12.2.6)
>>>
>>> After upgrading and restarting the mons everything looked OK, the mons
>>> had quorum, all OSDs where up and in and all the PGs where active+clean.
>>> But before I had time to start upgrading the OSDs it became obvious that
>>> something had gone terribly wrong.
>>> All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data
>>> was misplaced!
>>>
>>> The mons appears OK and all OSDs are still up and in, but a few hours
>>> later there was still 1483 pgs stuck inactive, essentially all of them in
>>> peering!
>>> Investigating one of the stuck PGs it appears to be looping between
>>> “inactive”, “remapped+peering” and “peering” and the epoch number is rising
>>> fast, see the attached pg query outputs.
>>>
>>> We really can’t afford to loose the cluster or the data so any help or
>>> suggestions on how to debug or fix this issue would be very, very
>>> appreciated!
>>>
>>>
>>> health: HEALTH_ERR
>>> 1483 pgs are stuck inactive for more than 60 seconds
>>> 542 pgs backfill_wait
>>>

Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.

2018-07-12 Thread David Majchrzak
Hi/Hej Magnus,

We had a similar issue going from latest hammer to jewel (so might not be 
applicable for you), with PGs stuck peering / data misplaced, right after 
updating all mons to latest jewel at that time 10.2.10.
Finally setting the require_jewel_osds put everything back in place ( we were 
going to do this after restarting all OSDs, following the docs/changelogs ).
What does your ceph health detail look like?
Did you perform any other commands after starting your mon upgrade? Any 
commands that might change the crush-map might cause issues AFAIK (correct me 
if im wrong, but i think we ran into this once) if your mons and osds are 
different versions.
// david
On jul 12 2018, at 11:45 am, Magnus Grönlund  wrote:
>
> Hi list,
>
> Things went from bad to worse, tried to upgrade some OSDs to Luminous to see 
> if that could help but that didn’t appear to make any difference.
> But for each restarted OSD there was a few PGs that the OSD seemed to 
> “forget” and the number of undersized PGs grew until some PGs had been 
> “forgotten” by all 3 acting OSDs and became stale, even though all OSDs (and 
> their disks) where available.
> Then the OSDs grew so big that the servers ran out of memory (48GB per server 
> with 10 2TB-disks per server) and started killing the OSDs…
> All OSDs where then shutdown to try and preserve some data on the disks at 
> least, but maybe it is too late?
>
> /Magnus
>
> 2018-07-11 21:10 GMT+02:00 Magnus Grönlund  (mailto:mag...@gronlund.se)>:
> > Hi Paul,
> >
> > No all OSDs are still jewel , the issue started before I had even started 
> > to upgrade the first OSD and they don't appear to be flapping.
> > ceph -w shows a lot of slow request etc, but nothing unexpected as far as I 
> > can tell considering the state the cluster is in.
> >
> > 2018-07-11 20:40:09.396642 osd.37 [WRN] 100 slow requests, 2 included 
> > below; oldest blocked for > 25402.278824 secs
> > 2018-07-11 20:40:09.396652 osd.37 [WRN] slow request 1920.957326 seconds 
> > old, received at 2018-07-11 20:08:08.439214: 
> > osd_op(client.73540057.0:8289463 2.e57b3e32 (undecoded) 
> > ack+ondisk+retry+write+known_if_redirected e160294) currently waiting for 
> > peered
> > 2018-07-11 20:40:09.396660 osd.37 [WRN] slow request 1920.048094 seconds 
> > old, received at 2018-07-11 20:08:09.348446: 
> > osd_op(client.671628641.0:998704 2.42f88232 (undecoded) 
> > ack+ondisk+retry+write+known_if_redirected e160475) currently waiting for 
> > peered
> > 2018-07-11 20:40:10.397008 osd.37 [WRN] 100 slow requests, 2 included 
> > below; oldest blocked for > 25403.279204 secs
> > 2018-07-11 20:40:10.397017 osd.37 [WRN] slow request 1920.043860 seconds 
> > old, received at 2018-07-11 20:08:10.353060: 
> > osd_op(client.231731103.0:1007729 3.e0ff5786 (undecoded) 
> > ondisk+write+known_if_redirected e137428) currently waiting for peered
> > 2018-07-11 20:40:10.397023 osd.37 [WRN] slow request 1920.034101 seconds 
> > old, received at 2018-07-11 20:08:10.362819: 
> > osd_op(client.207458703.0:2000292 3.a8143b86 (undecoded) 
> > ondisk+write+known_if_redirected e137428) currently waiting for peered
> > 2018-07-11 20:40:10.790573 mon.0 [INF] pgmap 4104 pgs: 5 down+peering, 1142 
> > peering, 210 remapped+peering, 5 active+recovery_wait+degraded, 1551 
> > active+clean, 2 activating+undersized+degraded+remapped, 15 
> > active+remapped+backfilling, 178 unknown, 1 active+remapped, 3 
> > activating+remapped, 78 active+undersized+degraded+remapped+backfill_wait, 
> > 6 active+recovery_wait+degraded+remapped, 3 
> > undersized+degraded+remapped+backfill_wait+peered, 5 
> > active+undersized+degraded+remapped+backfilling, 295 
> > active+remapped+backfill_wait, 3 active+recovery_wait+undersized+degraded, 
> > 21 activating+undersized+degraded, 559 active+undersized+degraded, 4 
> > remapped, 17 undersized+degraded+peered, 1 
> > active+recovery_wait+undersized+degraded+remapped; 13439 GB data, 42395 GB 
> > used, 160 TB / 201 TB avail; 4069 B/s rd, 746 kB/s wr, 5 op/s; 
> > 534753/10756032 objects degraded (4.972%); 779027/10756032 objects 
> > misplaced (7.243%); 256 MB/s, 65 objects/s recovering
> >
> >
> >
> >
> > There are a lot of things in the OSD-log files that I'm unfamiliar with but 
> > so far I haven't found anything that has given me a clue on how to fix the 
> > issue.
> > BTW restarting a OSD doesn't seem to help, on the contrary, that sometimes 
> > results in PGs beeing stuck undersized!
> > I have attaced a osd-log from when a OSD i restarted started up.
> >
> > Best regards
> > /Magnus
> >
> >
> > 2018-07-11 20:39 GMT+02:00 Paul Emmerich  > (mailto:paul.emmer...@croit.io)>:
> > > Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is 
> > > there anything weird in the OSDs' log files?
> > >
> > >
> > >
> > > Paul
> > >
> > > 2018-07-11 20:30 GMT+02:00 Magnus Grönlund  > > (mailto:mag...@gronlund.se)>:
> > > > Hi,
> > > >
> > > > Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminou

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo


Il 12/07/18 11:20, Alessandro De Salvo ha scritto:



Il 12/07/18 10:58, Dan van der Ster ha scritto:
On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum  
wrote:
On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo 
 wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in fact I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 
0x9ef2b41b on

10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 
0x9ef2b41b on

10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 
0x9ef2b41b on

10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but I'm not sure if 
it may

help.

No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), 
and
no disk problems anywhere. No relevant errors in syslogs, the hosts 
are

just fine. I cannot exclude an error on the RAID controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one on a different one, 
so I

would tend to exclude they both had (silent) errors at the same time.


That's fairly distressing. At this point I'd probably try extracting 
the object using ceph-objectstore-tool and seeing if it decodes 
properly as an mds journal. If it does, you might risk just putting 
it back in place to overwrite the crc.



Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep scrub, but 
I'm not sure it could repair in case all replicas are bad, as it seems 
to be the case.


I finally managed (with the help of Dan), to perform the deep-scrub on 
pg 10.14, but the deep scrub did not detect anything wrong. Also trying 
to repair 10.14 has no effect.

Still, trying to access the object I get in the OSDs:

2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log 
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
10:292cf221:::200.:head


Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds 
like a bug.

Can I force the repair someway?
Thanks,

   Alessandro




Alessandro, did you already try a deep-scrub on pg 10.14?


I'm waiting for the cluster to do that, I've sent it earlier this 
morning.



  I expect
it'll show an inconsistent object. Though, I'm unsure if repair will
correct the crc given that in this case *all* replicas have a bad crc.


Exactly, this is what I wonder too.
Cheers,

    Alessandro



--Dan

However, I'm also quite curious how it ended up that way, with a 
checksum mismatch but identical data (and identical checksums!) 
across the three replicas. Have you previously done some kind of 
scrub repair on the metadata pool? Did the PG perhaps get backfilled 
due to cluster changes?

-Greg



Thanks,


  Alessandro



Il 11/07/18 18:56, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
 wrote:

Hi John,

in fact I get an I/O error by hand too:


rados get -p cephfs_metadata 200. 200.
error getting cephfs_metadata/200.: (5) Input/output error

Next step would be to go look for corresponding errors on your OSD
logs, system logs, and possibly also check things like the SMART
counters on your hard drives for possible root causes.

John




Can this be recovered someway?

Thanks,


   Alessandro


Il 11/07/18 18:33, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
 wrote:

Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have 
been

marked as damaged. Trying to restart the instances only result in
standby MDSes. We currently have 2 filesystems active and 2 
MDSes each.


I found the following error messages in the mon:


mds.0 :6800/2412911269 down:damaged
mds.1 :6800/830539001 down:damaged
mds.0 :6800/4080298733 down:damaged


Whenever I try to force the repaired state with ceph mds repaired
: I get something like this in the MDS logs:


2018-07-11 13:20:41.597970 7ff7e010e700  0 
mds.1.journaler.mdlog(ro)

error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) 
log

[ERR] : Error recovering journal 0x201: (5) Input/output error

An EIO reading the journal header is pretty scary. The MDS itself
probably can't tell you much more about this: you need to dig down
into the RADOS layer.  Try reading the 200. object (that
happens to be the rank 0 journal header, every CephFS filesystem
should have one) using the `rados` command line tool.

John



Any attempt of running the journal export r

Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.

2018-07-12 Thread Magnus Grönlund
Hej David and thanks!

That was indeed the magic trick, no more peering, stale or down PGs.

Upgraded the ceph-packages on the hosts, restarted the OSDs and then "ceph
osd require-osd-release luminous"

/Magnus

2018-07-12 12:05 GMT+02:00 David Majchrzak :

> Hi/Hej Magnus,
>
> We had a similar issue going from latest hammer to jewel (so might not be
> applicable for you), with PGs stuck peering / data misplaced, right after
> updating all mons to latest jewel at that time 10.2.10.
>
> Finally setting the require_jewel_osds put everything back in place ( we
> were going to do this after restarting all OSDs, following the
> docs/changelogs ).
>
> What does your ceph health detail look like?
>
> Did you perform any other commands after starting your mon upgrade? Any
> commands that might change the crush-map might cause issues AFAIK (correct
> me if im wrong, but i think we ran into this once) if your mons and osds
> are different versions.
>
> // david
>
> On jul 12 2018, at 11:45 am, Magnus Grönlund  wrote:
>
>
> Hi list,
>
> Things went from bad to worse, tried to upgrade some OSDs to Luminous to
> see if that could help but that didn’t appear to make any difference.
> But for each restarted OSD there was a few PGs that the OSD seemed to
> “forget” and the number of undersized PGs grew until some PGs had been
> “forgotten” by all 3 acting OSDs and became stale, even though all OSDs
> (and their disks) where available.
> Then the OSDs grew so big that the servers ran out of memory (48GB per
> server with 10 2TB-disks per server) and started killing the OSDs…
> All OSDs where then shutdown to try and preserve some data on the disks at
> least, but maybe it is too late?
>
> /Magnus
>
> 2018-07-11 21:10 GMT+02:00 Magnus Grönlund :
>
> Hi Paul,
>
> No all OSDs are still jewel , the issue started before I had even started
> to upgrade the first OSD and they don't appear to be flapping.
> ceph -w shows a lot of slow request etc, but nothing unexpected as far as
> I can tell considering the state the cluster is in.
>
> 2018-07-11 20:40:09.396642 osd.37 [WRN] 100 slow requests, 2 included
> below; oldest blocked for > 25402.278824 secs
> 2018-07-11 20:40:09.396652 osd.37 [WRN] slow request 1920.957326 seconds
> old, received at 2018-07-11 20:08:08.439214: osd_op(client.73540057.0:8289463
> 2.e57b3e32 (undecoded) ack+ondisk+retry+write+known_if_redirected
> e160294) currently waiting for peered
> 2018-07-11 20:40:09.396660 osd.37 [WRN] slow request 1920.048094 seconds
> old, received at 2018-07-11 20:08:09.348446: osd_op(client.671628641.0:998704
> 2.42f88232 (undecoded) ack+ondisk+retry+write+known_if_redirected
> e160475) currently waiting for peered
> 2018-07-11 20:40:10.397008 osd.37 [WRN] 100 slow requests, 2 included
> below; oldest blocked for > 25403.279204 secs
> 2018-07-11 20:40:10.397017 osd.37 [WRN] slow request 1920.043860 seconds
> old, received at 2018-07-11 20:08:10.353060: osd_op(client.231731103.0:1007729
> 3.e0ff5786 (undecoded) ondisk+write+known_if_redirected e137428)
> currently waiting for peered
> 2018-07-11 20:40:10.397023 osd.37 [WRN] slow request 1920.034101 seconds
> old, received at 2018-07-11 20:08:10.362819: osd_op(client.207458703.0:2000292
> 3.a8143b86 (undecoded) ondisk+write+known_if_redirected e137428)
> currently waiting for peered
> 2018-07-11 20:40:10.790573 mon.0 [INF] pgmap 4104 pgs: 5 down+peering,
> 1142 peering, 210 remapped+peering, 5 active+recovery_wait+degraded, 1551
> active+clean, 2 activating+undersized+degraded+remapped, 15
> active+remapped+backfilling, 178 unknown, 1 active+remapped, 3
> activating+remapped, 78 active+undersized+degraded+remapped+backfill_wait,
> 6 active+recovery_wait+degraded+remapped, 3 
> undersized+degraded+remapped+backfill_wait+peered,
> 5 active+undersized+degraded+remapped+backfilling, 295
> active+remapped+backfill_wait, 3 active+recovery_wait+undersized+degraded,
> 21 activating+undersized+degraded, 559 active+undersized+degraded, 4
> remapped, 17 undersized+degraded+peered, 1 
> active+recovery_wait+undersized+degraded+remapped;
> 13439 GB data, 42395 GB used, 160 TB / 201 TB avail; 4069 B/s rd, 746 kB/s
> wr, 5 op/s; 534753/10756032 objects degraded (4.972%); 779027/10756032
> objects misplaced (7.243%); 256 MB/s, 65 objects/s recovering
>
>
>
> There are a lot of things in the OSD-log files that I'm unfamiliar with
> but so far I haven't found anything that has given me a clue on how to fix
> the issue.
> BTW restarting a OSD doesn't seem to help, on the contrary, that sometimes
> results in PGs beeing stuck undersized!
> I have attaced a osd-log from when a OSD i restarted started up.
>
> Best regards
> /Magnus
>
>
> 2018-07-11 20:39 GMT+02:00 Paul Emmerich :
>
> Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is
> there anything weird in the OSDs' log files?
>
>
> Paul
>
> 2018-07-11 20:30 GMT+02:00 Magnus Grönlund :
>
> Hi,
>
> Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
>
> A

Re: [ceph-users] KPIs for Ceph/OSD client latency / deepscrub latency overhead

2018-07-12 Thread Paul Emmerich
2018-07-12 8:37 GMT+02:00 Marc Schöchlin :

>
> In a first step i just would like to have  two simple KPIs which describe
> a average/aggregated write/read latency of these statistics.
>
> Are there tools/other functionalities which provide this in a simple way?
>
It's one of the main KPI our management software collects and visualizes:
https://croit.io

IIRC some of the other stats collectors also already collect these metrics,
at least I recall using it with Telegraf/InfluxDB.
But it's also really easy to collect yourself (I've once written it in bash
for some weird collector for a client), the only
hurdle is that you need to calculate the derivate because it collects a
running average.
I've some slides from our training about these metrics:
https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf
(Not much in there, it's more of a hands-on lab)



Paul



> Regards
> Marc
>
> Am 11.07.2018 um 18:42 schrieb Paul Emmerich:
>
> Hi,
>
> from experience: commit/apply_latency are not good metrics, the only good
> thing about them is that they are really easy to track.
> But we have found them to be almost completely useless in the real world.
>
> We track the op_*_latency metrics from perf dump and found them to be very
> helpful, they are more annoying to track due to their
> format. The median OSD is a good indicator and so is the slowest OSD.
>
> Paul
>
> 2018-07-11 17:50 GMT+02:00 Marc Schöchlin :
>
>> Hello ceph-users and ceph-devel list,
>>
>> we got in production with our new shiny luminous (12.2.5) cluster.
>> This cluster runs SSD and HDD based OSD pools.
>>
>> To ensure the service quality of the cluster and to have a baseline for
>> client latency optimization (i.e. in the area of deepscrub optimization)
>> we would like to have statistics about the client interaction latency of
>> our cluster.
>>
>> Which measures can be suitable to get such a "aggregated by
>> device_class" average latency KPI?
>> Also a percentile rank would be great (% amount of requests serviced by
>> < 5ms,  % amount of requests serviced by  < 20ms, % amount of requests
>> serviced by  < 50ms, ...)
>>
>> The following command provides a overview over the commit latency of the
>> osds but no average latency and no information about the device_class.
>>
>> ceph osd perf -f json-pretty
>>
>> {
>> "osd_perf_infos": [
>> {
>> "id": 71,
>> "perf_stats": {
>> "commit_latency_ms": 2,
>> "apply_latency_ms": 0
>> }
>> },
>> {
>> "id": 70,
>> "perf_stats": {
>> "commit_latency_ms": 3,
>> "apply_latency_ms": 0
>> }
>>
>> Device class information can be extracted of "ceph df -f json-pretty".
>>
>> But building averages of averages not seems to be a good thing  :-)
>>
>> It seems that i can get more detailed information using the "ceph daemon
>> osd. perf histogram dump" command.
>> This seems to deliver the percentile rank information in a good detail
>> level.
>> (http://docs.ceph.com/docs/luminous/dev/perf_histograms/)
>>
>> My questions:
>>
>> Are there tools to analyze and aggregate these measures for a group of
>> OSDs?
>>
>> Which measures should i use as a baseline for client latency optimization?
>>
>> What is the time horizon of these measures?
>>
>> I sometimes see messages like this in my log.
>> This seems to be sourced in deep scrubbing. How can find the
>> source/solution of this problem?
>>
>> 2018-07-11 16:58:55.064497 mon.ceph-mon-s43 [INF] Cluster is now healthy
>> 2018-07-11 16:59:15.141214 mon.ceph-mon-s43 [WRN] Health check failed: 4
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:25.037707 mon.ceph-mon-s43 [WRN] Health check update: 9
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:30.038001 mon.ceph-mon-s43 [WRN] Health check update:
>> 23 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:35.210900 mon.ceph-mon-s43 [WRN] Health check update:
>> 27 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:45.038718 mon.ceph-mon-s43 [WRN] Health check update:
>> 29 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:50.038955 mon.ceph-mon-s43 [WRN] Health check update:
>> 39 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:55.281279 mon.ceph-mon-s43 [WRN] Health check update:
>> 44 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 17:00:00.000121 mon.ceph-mon-s43 [WRN] overall HEALTH_WARN 12
>> slow requests are blocked > 32 sec
>> 2018-07-11 17:00:05.039677 mon.ceph-mon-s43 [WRN] Health check update:
>> 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 17:00:09.329897 mon.ceph-mon-s43 [INF] Health check cleared:
>> REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
>> 2018-07-11 17:00:09.329919 mon.ceph-mon-s43 [INF] Cluster is now healthy
>>
>> Regards
>> Marc
>>
>>

Re: [ceph-users] MDS damaged

2018-07-12 Thread Paul Emmerich
This might seem like a stupid suggestion, but: have you tried to restart
the OSDs?

I've also encountered some random CRC errors that only showed up when
trying to read an object,
but not on scrubbing, that magically disappeared after restarting the OSD.

However, in my case it was clearly related to
https://tracker.ceph.com/issues/22464 which doesn't
seem to be the issue here.

Paul

2018-07-12 13:53 GMT+02:00 Alessandro De Salvo <
alessandro.desa...@roma1.infn.it>:

>
> Il 12/07/18 11:20, Alessandro De Salvo ha scritto:
>
>
>>
>> Il 12/07/18 10:58, Dan van der Ster ha scritto:
>>
>>> On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum 
>>> wrote:
>>>
 On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo <
 alessandro.desa...@roma1.infn.it> wrote:

> OK, I found where the object is:
>
>
> ceph osd map cephfs_metadata 200.
> osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)
>
>
> So, looking at the osds 23, 35 and 18 logs in fact I see:
>
>
> osd.23:
>
> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
> 10:292cf221:::200.:head
>
>
> osd.35:
>
> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
> 10:292cf221:::200.:head
>
>
> osd.18:
>
> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
> 10:292cf221:::200.:head
>
>
> So, basically the same error everywhere.
>
> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it
> may
> help.
>
> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and
> no disk problems anywhere. No relevant errors in syslogs, the hosts are
> just fine. I cannot exclude an error on the RAID controllers, but 2 of
> the OSDs with 10.14 are on a SAN system and one on a different one, so
> I
> would tend to exclude they both had (silent) errors at the same time.
>

 That's fairly distressing. At this point I'd probably try extracting
 the object using ceph-objectstore-tool and seeing if it decodes properly as
 an mds journal. If it does, you might risk just putting it back in place to
 overwrite the crc.

 Wouldn't it be easier to scrub repair the PG to fix the crc?
>>>
>>
>> this is what I already instructed the cluster to do, a deep scrub, but
>> I'm not sure it could repair in case all replicas are bad, as it seems to
>> be the case.
>>
>
> I finally managed (with the help of Dan), to perform the deep-scrub on pg
> 10.14, but the deep scrub did not detect anything wrong. Also trying to
> repair 10.14 has no effect.
> Still, trying to access the object I get in the OSDs:
>
> 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR]
> : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
> 10:292cf221:::200.:head
>
> Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds
> like a bug.
> Can I force the repair someway?
> Thanks,
>
>Alessandro
>
>
>>
>>> Alessandro, did you already try a deep-scrub on pg 10.14?
>>>
>>
>> I'm waiting for the cluster to do that, I've sent it earlier this morning.
>>
>>   I expect
>>> it'll show an inconsistent object. Though, I'm unsure if repair will
>>> correct the crc given that in this case *all* replicas have a bad crc.
>>>
>>
>> Exactly, this is what I wonder too.
>> Cheers,
>>
>> Alessandro
>>
>>
>>> --Dan
>>>
>>> However, I'm also quite curious how it ended up that way, with a
 checksum mismatch but identical data (and identical checksums!) across the
 three replicas. Have you previously done some kind of scrub repair on the
 metadata pool? Did the PG perhaps get backfilled due to cluster changes?
 -Greg


> Thanks,
>
>
>   Alessandro
>
>
>
> Il 11/07/18 18:56, John Spray ha scritto:
>
>> On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
>>  wrote:
>>
>>> Hi John,
>>>
>>> in fact I get an I/O error by hand too:
>>>
>>>
>>> rados get -p cephfs_metadata 200. 200.
>>> error getting cephfs_metadata/200.: (5) Input/output error
>>>
>> Next step would be to go look for corresponding errors on your OSD
>> logs, system logs, and possibly also check things like the SMART
>> counters on your hard drives for possible root causes.
>>
>> John
>>
>>
>>
>> Can this be recovered someway?
>>>
>>> Thanks,
>>>
>>>
>>>Alessandro
>>>
>>>
>>> Il 11/07/18 18:33, John Spray h

[ceph-users] OSD tuning no longer required?

2018-07-12 Thread Robert Stanford
 I saw this in the Luminous release notes:

 "Each OSD now adjusts its default configuration based on whether the
backing device is an HDD or SSD. Manual tuning generally not required"

 Which tuning in particular?  The ones in my configuration are
osd_op_threads, osd_disk_threads, osd_recovery_max_active,
osd_op_thread_suicide_timeout, and osd_crush_chooseleaf_type, among
others.  Can I rip these out when I upgrade to
Luminous?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo

Unfortunately yes, all the OSDs were restarted a few times, but no change.

Thanks,


    Alessandro


Il 12/07/18 15:55, Paul Emmerich ha scritto:
This might seem like a stupid suggestion, but: have you tried to 
restart the OSDs?


I've also encountered some random CRC errors that only showed up when 
trying to read an object,
but not on scrubbing, that magically disappeared after restarting the 
OSD.


However, in my case it was clearly related to 
https://tracker.ceph.com/issues/22464 which doesn't

seem to be the issue here.

Paul

2018-07-12 13:53 GMT+02:00 Alessandro De Salvo 
>:



Il 12/07/18 11:20, Alessandro De Salvo ha scritto:



Il 12/07/18 10:58, Dan van der Ster ha scritto:

On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum
mailto:gfar...@redhat.com>> wrote:

On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
mailto:alessandro.desa...@roma1.infn.it>> wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object
'200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23) acting
([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in fact
I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but
I'm not sure if it may
help.

No SMART errors (the fileservers are SANs, in
RAID6 + LVM volumes), and
no disk problems anywhere. No relevant errors in
syslogs, the hosts are
just fine. I cannot exclude an error on the RAID
controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one on
a different one, so I
would tend to exclude they both had (silent)
errors at the same time.


That's fairly distressing. At this point I'd probably
try extracting the object using ceph-objectstore-tool
and seeing if it decodes properly as an mds journal.
If it does, you might risk just putting it back in
place to overwrite the crc.

Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep
scrub, but I'm not sure it could repair in case all replicas
are bad, as it seems to be the case.


I finally managed (with the help of Dan), to perform the
deep-scrub on pg 10.14, but the deep scrub did not detect anything
wrong. Also trying to repair 10.14 has no effect.
Still, trying to access the object I get in the OSDs:

2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster)
log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected
0x9ef2b41b on 10:292cf221:::200.:head

Was deep-scrub supposed to detect the wrong crc? If yes, them it
sounds like a bug.
Can I force the repair someway?
Thanks,

   Alessandro



Alessandro, did you already try a deep-scrub on pg 10.14?


I'm waiting for the cluster to do that, I've sent it earlier
this morning.

  I expect
it'll show an inconsistent object. Though, I'm unsure if
repair will
correct the crc given that in this case *all* replicas
have a bad crc.


Exactly, this is what I wonder too.
Cheers,

    Alessandro


--Dan

However, I'm also quite curious how it ended up that
way, with a checksum mismatch but identical data (and
identical checksums!) across the three replicas. Have
you previously

Re: [ceph-users] RADOSGW err=Input/output error

2018-07-12 Thread Drew Weaver
I never actually was able to fix this, I just moved on to something else.

I guess I will try 13 and see if maybe the bug has been fixed when it’s 
released.

-Drew

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Will 
Zhao
Sent: Thursday, July 12, 2018 4:32 AM
To: respo...@ifastnet.com
Cc: Ceph Users 
Subject: Re: [ceph-users] RADOSGW err=Input/output error

Hi :

I use libs3 to run test . The network is IB.
The error in libcurl is the following:

== Info: Operation too slow. Less than 1 bytes/sec transferred the last 15 
seconds

== Info: Closing connection 766

and a full request error in rgw is as the following:

2018-07-12 15:42:30.501074 7fe8bc83f700  1 civetweb: 0x7fe940ebf000: 
10.5.131.193 - - [12/Jul/2018:15:42:15 +0800] "PUT /yesonggao.hdd77/2217856 
HTTP/1.1" 1 0 - Mozilla/4.0 (Compatible; s3; libs3 2.0; Linux x86_64)
2018-07-12 15:43:05.332318 7fe8bc83f700 20 HTTP_HOST=localhost
2018-07-12 15:43:05.332326 7fe8bc83f700 20 HTTP_VERSION=1.1
2018-07-12 15:43:05.332327 7fe8bc83f700 20 REMOTE_ADDR=10.5.131.193
2018-07-12 15:43:05.332328 7fe8bc83f700 20 REQUEST_METHOD=HEAD
2018-07-12 15:43:05.332330 7fe8bc83f700 20 REQUEST_URI=/
2018-07-12 15:43:05.332331 7fe8bc83f700 20 SCRIPT_URI=/
2018-07-12 15:43:05.332331 7fe8bc83f700 20 SERVER_PORT=80
2018-07-12 15:43:05.332333 7fe8bc83f700  1 == starting new request 
req=0x7fe8bc839110 =
2018-07-12 15:43:05.332344 7fe8bc83f700  2 req 18326229:0.10::HEAD 
/::initializing for trans_id = tx00117a2d5-005b470689-d41b-default
2018-07-12 15:43:05.332351 7fe8bc83f700 10 rgw api priority: s3=5 s3website=4
2018-07-12 15:43:05.332352 7fe8bc83f700 10 host=localhost
2018-07-12 15:43:05.332354 7fe8bc83f700 20 subdomain= domain= 
in_hosted_domain=0 in_hosted_domain_s3website=0
2018-07-12 15:43:05.332355 7fe8bc83f700 20 final domain/bucket subdomain= 
domain= in_hosted_domain=0 in_hosted_domain_s3website=0 s->info.domain= 
s->info.request_uri=/
2018-07-12 15:43:05.332367 7fe8bc83f700 20 get_handler 
handler=26RGWHandler_REST_Service_S3
2018-07-12 15:43:05.332370 7fe8bc83f700 10 handler=26RGWHandler_REST_Service_S3
2018-07-12 15:43:05.332371 7fe8bc83f700  2 req 18326229:0.38:s3:HEAD 
/::getting op 3
2018-07-12 15:43:05.332373 7fe8bc83f700 10 op=26RGWListBuckets_ObjStore_S3
2018-07-12 15:43:05.332374 7fe8bc83f700  2 req 18326229:0.41:s3:HEAD 
/:list_buckets:verifying requester
2018-07-12 15:43:05.332376 7fe8bc83f700 20 
rgw::auth::StrategyRegistry::s3_main_strategy_t: trying 
rgw::auth::s3::AWSAuthStrategy
2018-07-12 15:43:05.332378 7fe8bc83f700 20 rgw::auth::s3::AWSAuthStrategy: 
trying rgw::auth::s3::S3AnonymousEngine
2018-07-12 15:43:05.332381 7fe8bc83f700 20 rgw::auth::s3::S3AnonymousEngine 
granted access
2018-07-12 15:43:05.332383 7fe8bc83f700 20 rgw::auth::s3::AWSAuthStrategy 
granted access
2018-07-12 15:43:05.332384 7fe8bc83f700  2 req 18326229:0.51:s3:HEAD 
/:list_buckets:normalizing buckets and tenants
2018-07-12 15:43:05.332386 7fe8bc83f700 10 s->object= s->bucket=
2018-07-12 15:43:05.332387 7fe8bc83f700  2 req 18326229:0.54:s3:HEAD 
/:list_buckets:init permissions
2018-07-12 15:43:05.332400 7fe8bc83f700  2 req 18326229:0.66:s3:HEAD 
/:list_buckets:recalculating target
2018-07-12 15:43:05.332402 7fe8bc83f700  2 req 18326229:0.68:s3:HEAD 
/:list_buckets:reading permissions
2018-07-12 15:43:05.332403 7fe8bc83f700  2 req 18326229:0.70:s3:HEAD 
/:list_buckets:init op
[root@SH-IDC1-10-5-30-221 ceph]# grep 7fe8bc83f700 
ceph-client.rgw.SH-IDC1-10-5-30-221.log | grep http_status=500 -B 50 -A 50
2018-07-12 15:42:15.499588 7fe8bc83f700 10 get_canon_resource(): 
dest=/yesonggao.hdd77/2217856
2018-07-12 15:42:15.499589 7fe8bc83f700 10 string_to_sign:
2018-07-12 15:42:15.499611 7fe8bc83f700 15 string_to_sign=PUT
2018-07-12 15:42:15.499620 7fe8bc83f700 15 server 
signature=14+b6nTNy3cPjVxFlqhRY+hahsA=
2018-07-12 15:42:15.499620 7fe8bc83f700 15 client 
signature=14+b6nTNy3cPjVxFlqhRY+hahsA=
2018-07-12 15:42:15.499621 7fe8bc83f700 15 compare=0
2018-07-12 15:42:15.499623 7fe8bc83f700 20 rgw::auth::s3::LocalEngine granted 
access
2018-07-12 15:42:15.499624 7fe8bc83f700 20 rgw::auth::s3::AWSAuthStrategy 
granted access
2018-07-12 15:42:15.499626 7fe8bc83f700  2 req 18325783:0.000112:s3:PUT 
/yesonggao.hdd77/2217856:put_obj:normalizing buckets and tenants
2018-07-12 15:42:15.499628 7fe8bc83f700 10 s->object=2217856 
s->bucket=yesonggao.hdd77
2018-07-12 15:42:15.499630 7fe8bc83f700  2 req 18325783:0.000116:s3:PUT 
/yesonggao.hdd77/2217856:put_obj:init permissions
2018-07-12 15:42:15.499642 7fe8bc83f700 15 decode_policy Read 
AccessControlPolicyhttp://s3.amazonaws.com/doc/2006-03-01/";>yesonggaoyesonggaohttp://www.w3.org/2001/XMLSchema-instance"; 
xsi:type="CanonicalUser">yesonggaoyesonggaoFULL_CONTROL
2018-07-12 15:42:15.499658 7fe8bc83f700  2 req 18325783:0.000144:s3:PUT 
/yesonggao.hdd77/2217856:put_obj:recalculating target
2018-07-12 15:42:15.499662 7fe8bc83f700  2 req 18325783:0.000148:s3:PUT 
/yesonggao.

Re: [ceph-users] Increase queue_depth in KVM

2018-07-12 Thread Damian Dabrowski
Hello,

Steffen, Thanks for Your reply. Sorry but I was on holidays, now I'm back
and still digging into my problem.. :(


I've read thousands of google links but can't find anything which could
help me.

- tried all qemu drive IO(io=) and cache(cache=) modes, nothing could come
even close to the results I'm getting outside KVM.
- enabling blk-mq inside KVM guest didn't help
- enabling iothreads didn't help
- 'queues' parameter in my libvirt schemas can be only applied to
'virtio-serial', can't use it in virtio-scsi nor virtio-blk)
- I've seen some people using 'num_queues' but I don't have this parameter
in my schemas(libvirt version = 1.3.1, qemu version = 2.5.0)



So, is there really no way to increase queue_depth to rbd device in kvm
domain or any other way to achieve results similar to those obtained
outside KVM? :/

wt., 26 cze 2018 o 15:19 Steffen Winther Sørensen 
napisał(a):

>
>
> > On 26 Jun 2018, at 14.04, Damian Dabrowski  wrote:
> >
> > Hi Stefan, thanks for reply.
> >
> > Unfortunately it didn't work.
> >
> > disk config:
> > 
> >discard='unmap'/>
> >   
> > 
> >   
> >name='volumes-nvme/volume-ce247187-a625-49f1-bacd-fc03df215395'>
> > 
> > 
> > 
> >   
> >   
> >   ce247187-a625-49f1-bacd-fc03df215395
> >   
> > 
> >
> >
> > Controller config:
> > 
> >   
> >function='0x0'/>
> > 
> >
> >
> > benchmark command: fio --randrepeat=1 --ioengine=libaio --direct=1
> --name=test --filename=test --bs=4k --iodepth=64 --size=1G
> --readwrite=randwrite --time_based --runtime=60
> --write_iops_log=write_results --numjobs=8
> >
> > And I'm still getting very low random write IOPS inside KVM instance
> with 8vcores(3-5k compared to 20k+ outside KVM)
> >
> > Maybe do You have any idea how to deal with it?
> What about trying with io=‘threads’ and/or maybe cache=’none’ or swap from
> virtio-scsi to blk-mq?
>
> Other people had similar issues, try asking ‘G’
>
> https://serverfault.com/questions/425607/kvm-guest-io-is-much-slower-than-host-io-is-that-normal
> https://wiki.mikejung.biz/KVM_/_Xen
>
> /Steffen
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo

Some progress, and more pain...

I was able to recover the 200. using the ceph-objectstore-tool 
for one of the OSDs (all identical copies) but trying to re-inject it 
just with rados put was giving no error while the get was still giving 
the same I/O error. So the solution was to rm the object and the put it 
again, that worked.


However, after restarting one of the MDSes and seeting it to repaired, 
I've hit another, similar problem:



2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log 
[ERR] : error reading table object 'mds0_inotable' -5 ((5) Input/output 
error)



Can I safely try to do the same as for object 200.? Should I 
check something before trying it? Again, checking the copies of the 
object, they have identical md5sums on all the replicas.


Thanks,


    Alessandro


Il 12/07/18 16:46, Alessandro De Salvo ha scritto:


Unfortunately yes, all the OSDs were restarted a few times, but no change.

Thanks,


    Alessandro


Il 12/07/18 15:55, Paul Emmerich ha scritto:
This might seem like a stupid suggestion, but: have you tried to 
restart the OSDs?


I've also encountered some random CRC errors that only showed up when 
trying to read an object,
but not on scrubbing, that magically disappeared after restarting the 
OSD.


However, in my case it was clearly related to 
https://tracker.ceph.com/issues/22464 which doesn't

seem to be the issue here.

Paul

2018-07-12 13:53 GMT+02:00 Alessandro De Salvo 
>:



Il 12/07/18 11:20, Alessandro De Salvo ha scritto:



Il 12/07/18 10:58, Dan van der Ster ha scritto:

On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum
mailto:gfar...@redhat.com>> wrote:

On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
mailto:alessandro.desa...@roma1.infn.it>> wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object
'200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23)
acting ([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in
fact I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but
I'm not sure if it may
help.

No SMART errors (the fileservers are SANs, in
RAID6 + LVM volumes), and
no disk problems anywhere. No relevant errors in
syslogs, the hosts are
just fine. I cannot exclude an error on the RAID
controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one
on a different one, so I
would tend to exclude they both had (silent)
errors at the same time.


That's fairly distressing. At this point I'd probably
try extracting the object using ceph-objectstore-tool
and seeing if it decodes properly as an mds journal.
If it does, you might risk just putting it back in
place to overwrite the crc.

Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep
scrub, but I'm not sure it could repair in case all replicas
are bad, as it seems to be the case.


I finally managed (with the help of Dan), to perform the
deep-scrub on pg 10.14, but the deep scrub did not detect
anything wrong. Also trying to repair 10.14 has no effect.
Still, trying to access the object I get in the OSDs:

2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster)
log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected
0x9ef2b41b on 10:292cf221:::200.:head


[ceph-users] Rook Deployments

2018-07-12 Thread Travis Nielsen
Any Rook users out there running Ceph in Kubernetes? We would love to hear 
about your experiences. Rook is currently hosted by the CNCF in the sandbox 
 stage and we are proposing 
that Rook graduate to the incubating stage. Part of graduating is growing the 
user base and showing a number of users running Rook in production scenarios. 
Your support would be appreciated.

If you haven’t heard of Rook before, it’s a great time to get started. In the 
next few days we will be releasing v0.8 and are going through a final test 
pass.This would be a great time to help us find if there are any issues. And 
after the 0.8 release will also be a great time to start your Rook clusters and 
see what it will take to run your production workloads. If you have any 
questions, the Rook Slack  is a great 
place to get them answered.

Thanks!
Travis
https://rook.io 
https://rook.io/docs/rook/master/ 
https://github.com/rook/rook 



signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How are you using tuned

2018-07-12 Thread Mohamad Gebai
Hi all,

I was wondering how people were using tuned with Ceph, if at all. I
think it makes sense to enable the throuhput-performance profile on OSD
nodes, and maybe the network-latency profiles on mon and mgr nodes. Is
anyone using a similar configuration, and do you have any thought on
this approach?

Thanks,
Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic (13.2.0) and "Failed to send data to Zabbix"

2018-07-12 Thread ceph . novice
There was no change in the ZABBIX environment... I got the this warning some 
minutes after the Linux and Luminous->Mimic update via YUM and a reboot of all 
the Ceph servers...

Is there anyone, who also had the ZABBIX module unabled under Luminos AND then 
migrated to Mimic? If yes, does it work "ok" in your place? If yes, which Linux 
OS/version are you running?

-

Ok, but the reason the Module is issuing the warning is that
zabbix_sender does not exit with status 0.

You might want to check why this is. Was there a version change of
Zabbix? If so, try to trace what might have changed that causes
zabbix_sender to exit non-zero.

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds daemon damaged

2018-07-12 Thread Kevin

Sorry for the long posting but trying to cover everything

I woke up to find my cephfs filesystem down. This was in the logs

2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 
0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head


I had one standby MDS, but as far as I can tell it did not fail over. 
This was in the logs


(insufficient standby MDS daemons available)

Currently my ceph looks like this
  cluster:
id: ..
health: HEALTH_ERR
1 filesystem is degraded
1 mds daemon damaged

  services:
mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
mgr: ids27(active)
mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
osd: 5 osds: 5 up, 5 in

  data:
pools:   3 pools, 202 pgs
objects: 1013k objects, 4018 GB
usage:   12085 GB used, 6544 GB / 18630 GB avail
pgs: 201 active+clean
 1   active+clean+scrubbing+deep

  io:
client:   0 B/s rd, 0 op/s rd, 0 op/s wr

I started trying to get the damaged MDS back online

Based on this page 
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts


# cephfs-journal-tool journal export backup.bin
2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is 
unreadable
2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
readable, attempt object-by-object dump with `rados`

Error ((5) Input/output error)

# cephfs-journal-tool event recover_dentries summary
Events by type:
2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is 
unreadableErrors: 0


cephfs-journal-tool journal reset - (I think this command might have 
worked)


Next up, tried to reset the filesystem

ceph fs reset test-cephfs-1 --yes-i-really-mean-it

Each time same errors

2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: 
MDS_DAMAGE (was: 1 mds daemon damaged)
2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 
assigned to filesystem test-cephfs-1 as rank 0
2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 
0x200: (5) Input/output error
2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds 
daemon damaged (MDS_DAMAGE)
2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 
0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head
2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 
filesystem is degraded; 1 mds daemon damaged


Tried to 'fail' mds.ds27
# ceph mds fail ds27
# failed mds gid 1929168

Command worked, but each time I run the reset command the same errors 
above appear


Online searches say the object read error has to be removed. But there's 
no object listed. This web page is the closest to the issue

http://tracker.ceph.com/issues/20863

Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it 
completes but still have the same issue above


Final option is to attempt removing mds.ds27. If mds.ds29 was a standby 
and has data it should become live. If it was not

I assume we will lose the filesystem at this point

Why didn't the standby MDS failover?

Just looking for any way to recover the cephfs, thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Patrick Donnelly
On Thu, Jul 12, 2018 at 2:30 PM, Kevin  wrote:
> Sorry for the long posting but trying to cover everything
>
> I woke up to find my cephfs filesystem down. This was in the logs
>
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a
> != expected 0x1c08241c on 2:292cf221:::200.:head

Being that this came from the OSD, you should look to resolve that
problem. What you've done below is blow the journal away which hasn't
helped you any because (a) now your journal is probably lost without a
lot of manual intervention and (b) the "new" journal is still written
to the same bad backing device/file so it's probably still unusable as
you found out.

> I had one standby MDS, but as far as I can tell it did not fail over. This
> was in the logs

If a rank becomes damaged, standbys will not take over. You must mark
it repaired first.

> (insufficient standby MDS daemons available)
>
> Currently my ceph looks like this
>   cluster:
> id: ..
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 mds daemon damaged
>
>   services:
> mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
> mgr: ids27(active)
> mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
> osd: 5 osds: 5 up, 5 in
>
>   data:
> pools:   3 pools, 202 pgs
> objects: 1013k objects, 4018 GB
> usage:   12085 GB used, 6544 GB / 18630 GB avail
> pgs: 201 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   0 B/s rd, 0 op/s rd, 0 op/s wr
>
> I started trying to get the damaged MDS back online
>
> Based on this page
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
>
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is
> unreadableErrors: 0
>
> cephfs-journal-tool journal reset - (I think this command might have worked)
>
> Next up, tried to reset the filesystem
>
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
>
> Each time same errors
>
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE
> (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned
> to filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200:
> (5) Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon
> damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a
> != expected 0x1c08241c on 2:292cf221:::200.:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is
> degraded; 1 mds daemon damaged
>
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
>
> Command worked, but each time I run the reset command the same errors above
> appear
>
> Online searches say the object read error has to be removed. But there's no
> object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
>
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
> completes but still have the same issue above
>
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and
> has data it should become live. If it was not
> I assume we will lose the filesystem at this point
>
> Why didn't the standby MDS failover?
>
> Just looking for any way to recover the cephfs, thanks!

I think it's time to do a scrub on the PG containing that object.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Patrick Donnelly
On Thu, Jul 12, 2018 at 3:55 PM, Patrick Donnelly  wrote:
>> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
>> completes but still have the same issue above
>>
>> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and
>> has data it should become live. If it was not
>> I assume we will lose the filesystem at this point
>>
>> Why didn't the standby MDS failover?
>>
>> Just looking for any way to recover the cephfs, thanks!
>
> I think it's time to do a scrub on the PG containing that object.

Sorry didn't read the part of the email that said you did that :) Did
you confirm that after the deep scrub finished that the pg is
active+clean? It looks like you're still scrubbing that PG.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Oliver Freyermuth
Hi,

all this sounds an awful lot like:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/027992.html
In htat case, things started with an update to 12.2.6. Which version are you 
running? 

Cheers,
Oliver

Am 12.07.2018 um 23:30 schrieb Kevin:
> Sorry for the long posting but trying to cover everything
> 
> I woke up to find my cephfs filesystem down. This was in the logs
> 
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a != 
> expected 0x1c08241c on 2:292cf221:::200.:head
> 
> I had one standby MDS, but as far as I can tell it did not fail over. This 
> was in the logs
> 
> (insufficient standby MDS daemons available)
> 
> Currently my ceph looks like this
>   cluster:
>     id: ..
>     health: HEALTH_ERR
>     1 filesystem is degraded
>     1 mds daemon damaged
> 
>   services:
>     mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
>     mgr: ids27(active)
>     mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
>     osd: 5 osds: 5 up, 5 in
> 
>   data:
>     pools:   3 pools, 202 pgs
>     objects: 1013k objects, 4018 GB
>     usage:   12085 GB used, 6544 GB / 18630 GB avail
>     pgs: 201 active+clean
>  1   active+clean+scrubbing+deep
> 
>   io:
>     client:   0 B/s rd, 0 op/s rd, 0 op/s wr
> 
> I started trying to get the damaged MDS back online
> 
> Based on this page 
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> 
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
> 
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is 
> unreadableErrors: 0
> 
> cephfs-journal-tool journal reset - (I think this command might have worked)
> 
> Next up, tried to reset the filesystem
> 
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
> 
> Each time same errors
> 
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE 
> (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned to 
> filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200: (5) 
> Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon 
> damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a != 
> expected 0x1c08241c on 2:292cf221:::200.:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is 
> degraded; 1 mds daemon damaged
> 
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
> 
> Command worked, but each time I run the reset command the same errors above 
> appear
> 
> Online searches say the object read error has to be removed. But there's no 
> object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
> 
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it 
> completes but still have the same issue above
> 
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and 
> has data it should become live. If it was not
> I assume we will lose the filesystem at this point
> 
> Why didn't the standby MDS failover?
> 
> Just looking for any way to recover the cephfs, thanks!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD tuning no longer required?

2018-07-12 Thread Konstantin Shalygin

  I saw this in the Luminous release notes:

  "Each OSD now adjusts its default configuration based on whether the
backing device is an HDD or SSD. Manual tuning generally not required"

  Which tuning in particular?  The ones in my configuration are
osd_op_threads, osd_disk_threads, osd_recovery_max_active,
osd_op_thread_suicide_timeout, and osd_crush_chooseleaf_type, among
others.  Can I rip these out when I upgrade to
Luminous?


This mean that some "bluestore_*" settings tuned for nvme/hdd separately.

Also with Luminous we have:

osd_op_num_shards_(ssd|hdd)

osd_op_num_threads_per_shard_(ssd|hdd)

osd_recovery_sleep_(ssd|hdd)




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increase queue_depth in KVM

2018-07-12 Thread Konstantin Shalygin

I've seen some people using 'num_queues' but I don't have this parameter
in my schemas(libvirt version = 1.3.1, qemu version = 2.5.0



num-queues is available from qemu 2.7 [1]


[1] https://wiki.qemu.org/ChangeLog/2.7




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2018-07-12 Thread Adam Tygart
I've hit this today with an upgrade to 12.2.6 on my backup cluster.
Unfortunately there were issues with the logs (in that the files
weren't writable) until after the issue struck.

2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log
[ERR] : 5.255 full-object read crc 0x4e97b4e != expected 0x6cfe829d on
5:aa448500:::500.:head

It is a backup cluster and I can keep it around or blow away the data
(in this instance) as needed for testing purposes.

--
Adam

On Thu, Jul 12, 2018 at 10:39 AM, Alessandro De Salvo
 wrote:
> Some progress, and more pain...
>
> I was able to recover the 200. using the ceph-objectstore-tool for
> one of the OSDs (all identical copies) but trying to re-inject it just with
> rados put was giving no error while the get was still giving the same I/O
> error. So the solution was to rm the object and the put it again, that
> worked.
>
> However, after restarting one of the MDSes and seeting it to repaired, I've
> hit another, similar problem:
>
>
> 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] :
> error reading table object 'mds0_inotable' -5 ((5) Input/output error)
>
>
> Can I safely try to do the same as for object 200.? Should I check
> something before trying it? Again, checking the copies of the object, they
> have identical md5sums on all the replicas.
>
> Thanks,
>
>
> Alessandro
>
>
> Il 12/07/18 16:46, Alessandro De Salvo ha scritto:
>
> Unfortunately yes, all the OSDs were restarted a few times, but no change.
>
> Thanks,
>
>
> Alessandro
>
>
> Il 12/07/18 15:55, Paul Emmerich ha scritto:
>
> This might seem like a stupid suggestion, but: have you tried to restart the
> OSDs?
>
> I've also encountered some random CRC errors that only showed up when trying
> to read an object,
> but not on scrubbing, that magically disappeared after restarting the OSD.
>
> However, in my case it was clearly related to
> https://tracker.ceph.com/issues/22464 which doesn't
> seem to be the issue here.
>
> Paul
>
> 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo
> :
>>
>>
>> Il 12/07/18 11:20, Alessandro De Salvo ha scritto:
>>
>>>
>>>
>>> Il 12/07/18 10:58, Dan van der Ster ha scritto:

 On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum 
 wrote:
>
> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
>  wrote:
>>
>> OK, I found where the object is:
>>
>>
>> ceph osd map cephfs_metadata 200.
>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)
>>
>>
>> So, looking at the osds 23, 35 and 18 logs in fact I see:
>>
>>
>> osd.23:
>>
>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>> on
>> 10:292cf221:::200.:head
>>
>>
>> osd.35:
>>
>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>> on
>> 10:292cf221:::200.:head
>>
>>
>> osd.18:
>>
>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>> on
>> 10:292cf221:::200.:head
>>
>>
>> So, basically the same error everywhere.
>>
>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it
>> may
>> help.
>>
>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes),
>> and
>> no disk problems anywhere. No relevant errors in syslogs, the hosts
>> are
>> just fine. I cannot exclude an error on the RAID controllers, but 2 of
>> the OSDs with 10.14 are on a SAN system and one on a different one, so
>> I
>> would tend to exclude they both had (silent) errors at the same time.
>
>
> That's fairly distressing. At this point I'd probably try extracting
> the object using ceph-objectstore-tool and seeing if it decodes properly 
> as
> an mds journal. If it does, you might risk just putting it back in place 
> to
> overwrite the crc.
>
 Wouldn't it be easier to scrub repair the PG to fix the crc?
>>>
>>>
>>> this is what I already instructed the cluster to do, a deep scrub, but
>>> I'm not sure it could repair in case all replicas are bad, as it seems to be
>>> the case.
>>
>>
>> I finally managed (with the help of Dan), to perform the deep-scrub on pg
>> 10.14, but the deep scrub did not detect anything wrong. Also trying to
>> repair 10.14 has no effect.
>> Still, trying to access the object I get in the OSDs:
>>
>> 2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR]
>> : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
>> 10:292cf221:::200.:head
>>
>> Was deep-scrub supposed t