[ceph-users] Interesting re-shuffling of pg's after adding new osd
Hi, Two days ago I added a new osd to one of my ceph machines, because one of the existing osd's got rather full. There was quite a difference in disk space usage between osd's, but I understand this is kind of just how ceph works. It spreads data over osd's but not perfectly even. Now check out the graph of free disk space. You can clearly see the new 4TB osd added and how it starts to fill up. It's also quite visible that some existing osd's profit more than others. And not only is data put onto the new osd, but also data is exchanged between existing osd's. This is also why it takes so incredibly long to fill the new osd up, because ceph is spending most its time shuffling data around instead of moving it to the new osd. Anyway, what is especially troubling, is that the osd that was already lowest on disk space, is actually filling up even more during this process (!) What's causing that and how can I get ceph to do the reasonable thing? All crush weights are identical. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] new relic ceph plugin
Hi all, I want to know if someone has deploy some new relic (pyhon) plugin for Ceph. Thanks a lot, Best regards, *Ger* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
Hi, Sorry for my late answer. Gregory Farnum wrote: >> 1. Is this kind of freeze normal? Can I avoid these freezes with a >> more recent version of the kernel in the client? > > Yes, it's normal. Although you should have been able to do a lazy > and/or force umount. :) Ah, I haven't tried it. Maybe I'm wrong but I think a "lazy" or a "force" umount wouldn't succeed. I'll try to test if I can reproduce the freeze. > You can't avoid the freeze with a newer client. :( > > If you notice the problem quickly enough, you should be able to > reconnect everything by rebooting the MDS — although if the MDS hasn't > failed the client then things shouldn't be blocking, so actually that > probably won't help you. Yes, the mds was completely ok and after the hard-reboot of the client, the client had access again to the cephfs with the exactly same mds service in the cluster side (no restart etc). >> 2. Can I avoid these freezes with ceph-fuse instead of the kernel >> cephfs module? But in this case, the cephfs performance will be >> worse. Am I wrong? > > No, ceph-fuse will suffer the same blockage, although obviously in > userspace it's a bit easier to clean up. Yes, I suppose that after "kill" commands, I would be able to remount the cephfs without any reboot etc., isn't it? > Depending on your workload it > will be slightly faster to a lot slower. Though you'll also get > updates faster/more easily. ;) Yes, I imagine that with "ceph-fuse" I have a completely updated cephfs-client (in user-space) whereas with the cephfs-client kernel I have just the version available in the current kernel of my client node (3.16 in my case). >> 3. Is there a parameter in ceph.conf to tell mds to be more patient >> before closing the "stale session" of a client? > > Yes. You'll need to increase the "mds session timeout" value on the > MDS; it currently defaults to 60 seconds. You can increase that to > whatever values you like. The tradeoff here is that if you have a > client die, anything it had "capabilities' on (for read/write access) > will be unavailable for anybody who's doing something that might > conflict with those capabilities. Ok, thanks for the warning, it seems logical. > If you've got a new enough MDS (Hammer, probably, but you can check) Yes, I use Hammer. > then you can use the admin socket to boot specific sessions, so it may > suit you to set very large timeouts and manually zap any client which > actually goes away badly (rather than getting disconnected by the > network). Ok, I see. According to the online documentation, the way to close a cephfs client session is: ceph daemon mds.$id session ls # to get the $session_id and the $address ceph osd blacklist add $address ceph osd dump # to get the $epoch ceph daemon mds.$id osdmap barrier $epoch ceph daemon mds.$id session evict $session_id Is it correct? With the commands above, could I reproduce the client freeze in my testing cluster? I'll try because it convenient to be able reproduce the problem just with command lines (without to really stop the network in the client etc). I would like to test if, with ceph-fuse, I can easily restore the situation of my client. >> I'm in a testing period and a hard reboot of my cephfs clients would >> be quite annoying for me. Thanks in advance for your help. > > Yeah. Unfortunately there's a basic tradeoff in strictly-consistent > (aka POSIX) network filesystems here: if the network goes away, you > can't be consistent any more because the disconnected client can make > conflicting changes. And you can't tell exactly when the network > disappeared. And could it be conceivable one day (for instance with an option) to be able to change the behavior of cephfs to be *not*-strictly-consistent, like NFS for instance? It seems to me it could improve performances of cephfs and cephfs could be more flexible concerning short network failure (not really sure for this second point). Ok it's just a remark of a simple and unqualified ceph-user ;) but it seems to me that NFS isn't strictly consistent and generally this not a problem in many use cases. Am I wrong? > So while we hope to make this less painful in the future, the network > dying that badly is a failure case that you need to be aware of > meaning that the client might have conflicting information. If it > *does* have conflicting info, the best we can do about it is be > polite, return a bunch of error codes, and unmount gracefully. We'll > get there eventually but it's a lot of work. Yes, I can imagine the amount of work... Thank a lot Greg for your answer. ;) -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
John Spray wrote: > Greg's response is pretty comprehensive, but for completeness I'll add that > the specific case of shutdown blocking is http://tracker.ceph.com/issues/9477 Yes indeed, during the freeze, "INFO: task sync:3132 blocked for more than 120 seconds..." was exactly the message I have seen in the VNC console of the client (it was a Openstack VM). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup hundreds or thousands of TB
Hi, Wido den Hollander wrote: > Aren't snapshots something that should protect you against removal? IF > snapshots work properly in CephFS you could create a snapshot every hour. Are you talking about the .snap/ directory in a cephfs directory? If yes, does it work well? Because, with Hammer, if I want to enable this feature: ~# ceph mds set allow_new_snaps true Error EPERM: Snapshots are unstable and will probably break your FS! Set to --yes-i-really-mean-it if you are sure you want to enable them I have never tried with the --yes-i-really-mean-it option. The warning is not very encouraging. ;) > With the recursive statistics [0] of CephFS you could "easily" backup > all your data to a different Ceph system or anything not Ceph. What is the link between this (very interesting) recursive statistics feature and the backup? I'm not sure to understand. Can you explain me? Maybe you test if the size of a directory has changed? > I've done this with a ~700TB CephFS cluster and that is still working > properly. > > Wido > > [0]: > http://blog.widodh.nl/2015/04/playing-with-cephfs-recursive-statistics/ Thanks Wido for this very interesting (and very simple) feature. But does it work well? Because, I use Hammer in a Ubuntu Trusty cluster nodes, and in a Ubuntu Trusty client with 3.16 kernel and cephfs mounted with the kernel module client, I have this: ~# mount | grep cephfs # /mnt is my mounted cephfs 10.0.2.150,10.0.2.151,10.0.2.152:/ on /mnt type ceph (noacl,name=cephfs,key=client.cephfs) ~# ls -lah /mnt/dir1/ total 0 drwxr-xr-x 1 root root 96M May 12 21:06 . drwxr-xr-x 1 root root 103M May 17 23:56 .. drwxr-xr-x 1 root root 96M May 12 21:06 8 drwxr-xr-x 1 root root 4.0M May 17 23:57 test As you can see: /mnt/dir1/8/ => 96M /mnt/dir1/test/ => 4.0M But: /mnt/dir1/ (ie .) => 96M I should have: size("/mnt/dir1/") = size("/mnt/dir1/8/") + size("/mnt/dir1/test/") and this is not the case. Is it normal? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG scrubbing taking a long time
Hello everyone. I’m having an interesting thing happening to me. I have a PG that has been doing a deep scrub for 3 days. Other PGs start scrubbing and finish within a minute or two, but this PG just will not finish scrubbing at all. Any ideas as to how I can kick the scrub or nudge it into finishing? Thanks. === Tu Holmes tu.hol...@gmail.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com