Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Yan, Zheng
On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett wrote: > > Dear All, > > Unfortunately the MDS has crashed on our Mimic cluster... > > First symptoms were rsync giving: > "No space left on device (28)" > when trying to rename or delete > > This prompted me to try restarting the MDS, as it reported l

Re: [ceph-users] Controlling CephFS hard link "primary name" for recursive stat

2019-02-11 Thread Yan, Zheng
On Sat, Feb 9, 2019 at 8:10 AM Hector Martin wrote: > > Hi list, > > As I understand it, CephFS implements hard links as effectively "smart > soft links", where one link is the primary for the inode and the others > effectively reference it. When it comes to directories, the size for a > hardlinke

Re: [ceph-users] pool/volume live migration

2019-02-11 Thread Luis Periquito
Hi Jason, that's been very helpful, but it got me thinking and looking. The pool name is both inside the libvirt.xml (and running KVM config) and it's cached in the Nova database. For it to change would require a detach/attach which may not be viable or easy, specially for boot volumes. What abo

[ceph-users] NAS solution for CephFS

2019-02-11 Thread Marvin Zhang
Hi, As http://docs.ceph.com/docs/master/cephfs/nfs/ says, it's OK to config active/passive NFS-Ganesha to use CephFs. My question is if we can use active/active nfs-ganesha for CephFS. In my thought, only state consistance should we think about. 1. Lock support for Active/Active. Even each nfs-gane

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-11 Thread Igor Fedotov
On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: another mempool dump after 1h run. (latency ok) Biggest difference: before restart - "bluestore_cache_other": { "items": 48661920, "bytes": 1539544228 }, "bluestore_cache_data": { "items": 54, "bytes": 643072 }, (other caches seem to b

Re: [ceph-users] Upgrade Luminous to mimic on Ubuntu 18.04

2019-02-11 Thread ceph
Hello Ashley, Am 9. Februar 2019 17:30:31 MEZ schrieb Ashley Merrick : >What does the output of apt-get update look like on one of the nodes? > >You can just list the lines that mention CEPH > ... .. . Get:6 Https://Download.ceph.com/debian-luminous bionic InRelease [8393 B] ... .. . The Last a

Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Jake Grimmett
Hi Zheng, Many, many thanks for your help... Your suggestion of setting large values for mds_cache_size and mds_cache_memory_limit stopped our MDS crashing :) The values in ceph.conf are now: mds_cache_size = 8589934592 mds_cache_memory_limit = 17179869184 Should these values be left in our co

Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Jake Grimmett
Hi Zheng, Sorry - I've just re-read your email and saw your instruction to restore the mds_cache_size and mds_cache_memory_limit to original values if the MDS does not crash - I have now done this... thanks again for your help, best regards, Jake On 2/11/19 12:01 PM, Jake Grimmett wrote: > Hi

Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())

2019-02-11 Thread Yan, Zheng
On Mon, Feb 11, 2019 at 8:01 PM Jake Grimmett wrote: > > Hi Zheng, > > Many, many thanks for your help... > > Your suggestion of setting large values for mds_cache_size and > mds_cache_memory_limit stopped our MDS crashing :) > > The values in ceph.conf are now: > > mds_cache_size = 8589934592 > m

Re: [ceph-users] pool/volume live migration

2019-02-11 Thread Jason Dillaman
On Mon, Feb 11, 2019 at 4:53 AM Luis Periquito wrote: > > Hi Jason, > > that's been very helpful, but it got me thinking and looking. > > The pool name is both inside the libvirt.xml (and running KVM config) > and it's cached in the Nova database. For it to change would require a > detach/attach w

[ceph-users] Update / upgrade cluster with MDS from 12.2.7 to 12.2.11

2019-02-11 Thread Götz Reinicke
Hi, as 12.2.11 is out for some days and no panic mails showed up on the list I was planing to update too. I know there are recommended orders in which to update/upgrade the cluster but I don’t know how rpm packages are handling restarting services after a yum update. E.g. when MDS and MONs are

Re: [ceph-users] faster switch to another mds

2019-02-11 Thread Gregory Farnum
You can't tell from the client log here, but probably the MDS itself was failing over to a new instance during that interval. There's not much experience with it, but you could experiment with faster failover by reducing the mds beacon and grace times. This may or may not work reliably... On Sat,

Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-11 Thread Gregory Farnum
On Thu, Feb 7, 2019 at 3:31 AM Hector Martin wrote: > On 07/02/2019 19:47, Marc Roos wrote: > > > > Is this difference not related to chaching? And you filling up some > > cache/queue at some point? If you do a sync after each write, do you > > have still the same results? > > No, the slow operat

Re: [ceph-users] Update / upgrade cluster with MDS from 12.2.7 to 12.2.11

2019-02-11 Thread Patrick Donnelly
On Mon, Feb 11, 2019 at 12:10 PM Götz Reinicke wrote: > as 12.2.11 is out for some days and no panic mails showed up on the list I > was planing to update too. > > I know there are recommended orders in which to update/upgrade the cluster > but I don’t know how rpm packages are handling restarti

Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-11 Thread ST Wong (ITSC)
Hi all, Tested 4 cases. Case 1-3 are as expected, while for case 4, rebuild didn’t take place on surviving room as Gregory mentioned. Repeated case 4 several times on both rooms got same result. We’re running mimic 13.2.2. E.g. Room1 Host 1 osd: 2,5 Host 2 osd: 1,3 Room 2 <-- failed roo

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-11 Thread Massimo Sgaravatto
Thanks a lot Brad ! The problem is indeed in the network: we moved the OSD nodes back to the "old" switches and the problem disappeared. Now we have to figure out what is wrong/misconfigured with the new switch: we would try to replicate the problem, possibly without a ceph deployment ... Thanks

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-11 Thread Brad Hubbard
Glad to help! On Tue, Feb 12, 2019 at 4:55 PM Massimo Sgaravatto wrote: > > Thanks a lot Brad ! > > The problem is indeed in the network: we moved the OSD nodes back to the > "old" switches and the problem disappeared. > > Now we have to figure out what is wrong/misconfigured with the new switch