[ceph-users] Behavior of ceph-fuse when network is down
Hi all, To observe what will happen to ceph-fuse mount if the network is down, we blocked network connections to all three monitors by iptables. If we restore the network immediately(within minutes), the blocked I/O request will be restored, every thing will be back to normal. But if we continue to block it long enough, say twenty minutes, ceph-fuse will not be able to restore. The ceph-fuse process is still there, but will not be able to handle I/O operations, df or ls will hang indefinitely. What is the retry policy of ceph-fuse? Is it normal for ceph-fuse to hang after the network blocking? If so, how can I make it restore to normal after the network is recovered? If it is not normal, what might be the cause? How can I help to debug this? Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Behavior of ceph-fuse when network is down
On Fri, Nov 24, 2017 at 4:59 PM, Zhang Qiang wrote: > Hi all, > > To observe what will happen to ceph-fuse mount if the network is down, we > blocked > network connections to all three monitors by iptables. If we restore the > network > immediately(within minutes), the blocked I/O request will be restored, every > thing will > be back to normal. > > But if we continue to block it long enough, say twenty minutes, ceph-fuse > will not be > able to restore. The ceph-fuse process is still there, but will not be able > to handle I/O > operations, df or ls will hang indefinitely. > > What is the retry policy of ceph-fuse? Is it normal for ceph-fuse to hang > after the > network blocking? If so, how can I make it restore to normal after the > network is > recovered? If it is not normal, what might be the cause? How can I help to > debug this? you can use 'kick_stale_sessions' ASOK command to make ceph-fuse reconnect, or set 'client_reconnect_stale' config option to true. Besides, you need to set mds config option 'mds_session_blacklist_on_timeout' to false. > > Thanks. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Sharing Bluestore WAL
On 23/11/17 17:19, meike.talb...@women-at-work.org wrote: > Hello, > > in our preset Ceph cluster we used to have 12 HDD OSDs per host. > All OSDs shared a common SSD for journaling. > The SSD was used as root device and the 12 journals were files in the > /usr/share directory, like this: > > OSD 1 - data /dev/sda - journal /usr/share/sda > OSD 2 - data /dev/sdb - journal /usr/share/sdb > ... > > We now want to migrate to Bluestore and continue to use this approach. > I tried to use "ceph-deploy osd prepare test04:sdc --bluestore --block-db > /var/local/sdc-block --block-wal /var/local/sdc-wal" to setup an OSD which > essentially works. > > However I'm wondering is this correct at all. > And how can I make sure that the sdc-block and sdc-wal to not fill up the SSD > disk. > Is there any option to limit the file size and what are the recommended value > of such an option? > > Thank you > > Meike The maximum size of the WAL is dependent on cluster configuration values, but it will always be relatively small. There is no maximum DB size or, as it stands, good estimates for how large a DB may realistically grow. The expected behaviour is that if the DB outgrows its device it will spill over onto the data device. I don't believe there is any option that would let you effectively limit the size of files if you're using flat files to back your devices. Using files for your DB/WAL is not recommended practice - you have the space problems that you mention and you'll also be suffering a performance hit by sticking a filesystem layer in the middle of things. Realistically, you should partition your SSD and provide entire partitions as the devices on which to store your OSD DBs. There is no point in specifying the WAL as a separate device unless you're doing something advanced; it will be stored alongside the DB on the DB device if not otherwise specified, and since you're putting them on the same device anyway you get no advantage to splitting them. With everything partitioned off correctly, you don't have to worry about Ceph data enroaching on your root FS space. I would also worry that unless that one SSD is very large, 12 HDDs : 1 SSD could be overdoing it. Filestore journals sustained a lot of writing but didn't need to be very large, comparatively; Bluestore database w/ WAL is a lot lighter on the I/O but does need considerably more standing space since it's actually permanently storing metadata rather than just write journalling. If it's the case that you've only got a few GB of space you can spare for each DB, you're probably going to overgrow that very quickly and you won't see much actual benefit from using the SSD. Rich signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] "failed to open ino"
Hi all, with our Ceph Luminous CephFS, we're plaqued with "failed to open ino" messages. These don't seem to affect daily business (in terms of "file access"). (There's a backup performance issue that may eventually be related, but I'll report on that in a different thread.) Our Ceph currently is at v12.2.1 (git.1507910930.aea79b8b7a), on OpenSUSE Leap 42.3. Three Ceph nodes, 12 HDD OSDs, two SSD OSDs, status "HEALTH_OK". We have a single CephFS and two MDS (active/standby), metadata pool is on SSD OSDs, content is on HDD OSDs (all file stores). That CephFS is mounted by several clients (via kernel cephfs support, mostly kernel version 4.4.76) and via NFS (kernel nfsd on a kernel-mounted CephFS). In the log of the active MDS, we currently see the following two inodes reported over and over again, about every 30 seconds: --- cut here --- 2017-11-24 18:24:16.496397 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e45e1d err -22/0 2017-11-24 18:24:16.497037 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e4d6a1 err -22/-22 2017-11-24 18:24:16.500645 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e45e1d err -22/0 2017-11-24 18:24:16.501218 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e4d6a1 err -22/-22 2017-11-24 18:24:46.506210 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e45e1d err -22/0 2017-11-24 18:24:46.506926 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e4d6a1 err -22/-22 2017-11-24 18:24:46.510354 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e45e1d err -22/0 2017-11-24 18:24:46.510891 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e4d6a1 err -22/-22 --- cut here --- There were other reported inodes with other errors, too ("err -5/0", for instance), the root cause seems to be the same (see below). For the 0x10001e4d6a1 inode, it was first mentioned in the MDS log as follows, and then every 30 seconds as "failed to open". 0x10001e45e1d just appeared at the same time, "out of the blue": --- cut here --- 2017-11-23 19:12:28.440107 7f3586cbd700 0 mds.0.bal replicating dir [dir 0x10001e4d6a1 /some/path/in/our/cephfs/ [2,head] auth pv=3473 v=3471 cv=2963/2963 ap=1+6+6 state=1610612738|complete f(v0 m2017-11-23 19:12:25.005299 15=4+11) n(v74 rc2017-11-23 19:12:28.429258 b137317605 5935=4875+1060)/n(v74 rc2017-11-23 19:12:28.337259 b139723963 5969=4898+1071) hs=15+23,ss=0+0 dirty=30 | child=1 dirty=1 authpin=1 0x5649e811e9c0] pop 10223.1 .. rdp 373 adj 0 2017-11-23 19:15:55.015347 7f3580cb1700 0 mds.0.cache failed to open ino 0x10001e45e1d err -22/0 2017-11-23 19:15:55.016056 7f3580cb1700 0 mds.0.cache failed to open ino 0x10001e4d6a1 err -22/-22 --- cut here --- According to a message from this ML, "replicating dir" is supposed to indicate that the directory is hot and being replicated to another active MDS to spread the load. But there is no other active MDS, as we have only two in total (and Ceph properly reports "mds: cephfs-1/1/1 up {0=server1=up:active}, 1 up:standby-replay". From what I took from the logs, all error messages seem to be related to that same path "/some/path/in/our/cephfs/" all the time, which gets deleted and recreated every evening - thus changing ino numbers. But what can it be that makes that same path, buried far down in some cephfs hierarchy, cause these "failed to open ino" messages? It's just one of a lot of directories, and no particularly populated one (about 40 files and directories). Another possibly interesting side fact: The previous ino's messages sometimes stop when our evening/night job ends (no more active accesses to that directory), sometimes they continue throughout the day. I've even noticed it crossing the directory recreation time (messages for old inode don't stop when the old directory is deleted, except for after a new "replicating dir" message appears). Today I deleted the directory (including its parent) during the day, but the messages wouldn't stop. The messages also survive switching to the other MDS. And by now the next daily instance of our job has started, recreating the directories (that I had deleted throughout the day). The "failed to open ino" messages for the old ino numbers still appears every 30 seconds, I've not yet seen any "replicating dir" message for any of that cephfs tree area. I have seen a few for other areas of the cephfs tree, but no other ino numbers were reported as "failed to open" - only the two from above: --- cut here --- 2017-11-24 19:22:13.50 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e45e1d err -22/0 2017-11-24 19:22:13.000770 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e4d6a1 err -22/-22 2017-11-24 19:22:13.003918 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e45e1d err -22/0 2017-11-24 19:22:13.004469 7fa308cf0700 0 mds.0.cache failed to open ino 0x10001e4d6a1 err -22/-22 2017-11-24
Re: [ceph-users] Behavior of ceph-fuse when network is down
Thanks! I'll check it out. 2017年11月24日 17:58,"Yan, Zheng" 写道: > On Fri, Nov 24, 2017 at 4:59 PM, Zhang Qiang > wrote: > > Hi all, > > > > To observe what will happen to ceph-fuse mount if the network is down, we > > blocked > > network connections to all three monitors by iptables. If we restore the > > network > > immediately(within minutes), the blocked I/O request will be restored, > every > > thing will > > be back to normal. > > > > But if we continue to block it long enough, say twenty minutes, ceph-fuse > > will not be > > able to restore. The ceph-fuse process is still there, but will not be > able > > to handle I/O > > operations, df or ls will hang indefinitely. > > > > What is the retry policy of ceph-fuse? Is it normal for ceph-fuse to hang > > after the > > network blocking? If so, how can I make it restore to normal after the > > network is > > recovered? If it is not normal, what might be the cause? How can I help > to > > debug this? > > you can use 'kick_stale_sessions' ASOK command to make ceph-fuse > reconnect, or set 'client_reconnect_stale' config option to true. > Besides, you need to set mds config option > 'mds_session_blacklist_on_timeout' to false. > > > > > Thanks. > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com