We’ve been hard at work on CephFS over the last year since Firefly was 
released, and with Hammer coming out it seemed like a good time to go over some 
of the big developments users will find interesting. Much of this is cribbed 
from John’s Linux Vault talk 
(http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf), 
in addition to the release notes (http://ceph.com/docs/master/release-notes/).
===========================================================================
New Filesystem features & improvements:

ceph-fuse has gained support for fcntl and flock locking. (Yan, Zheng) This has 
been in the kernel for a while but nobody had done the work to implement 
tracking structures and wire it up in userspace.

ceph-fuse has gained support for soft quotas, enforced on the client side. 
(Yunchuan Wen) The Ubuntu Kylin guys worked on this for quite a while and we 
thank them for their work and their patience. You can now specify soft quotas 
on a directory and ceph-fuse will behave as you’d expect from that.

Hadoop support has been generally improved and updated. (Noah Watkins, Huamin 
Chen) It now works against the 2.0 API, the tests we run in our lab are more 
sophisticated, and it’s a lot friendlier to install with Maven and other Java 
tools. Noah’s still doing work on this to make it as turnkey as possible, but 
soon you’ll just need to drop a single JAR on the system (this will include the 
libcephfs stuff, so you don’t even need to worry about those packages and 
compatibility!) and change a few config options.

ceph-fuse and CephFS as a whole now have much-improved full space handling. If 
you run out of space at the RADOS layer you will get ENOSPC errors in the 
client (instead of it retrying indefinitely), and these errors (and others) are 
now propagated out to fsync and fclose calls.

We are now much more consistent in our handling of timestamps. Previously we 
attempted to take the time from whichever process was responsible for making a 
change, which could be either a client or the MDS. But this was troublesome if 
their times weren’t synced — made worse by trying not to let the time move 
backwards — and some applications which relied on sharing mtime and ctime 
values as versions (Hadoop and rsync both did this in certain configurations) 
were unhappy. We now use a timestamp provided by the client for all operations, 
which has been more stable.

Certain internal data structures are now much more scalable on a per-client 
level. We had issues when certain “MDSTables” got too large, but John Spray 
sorted them out.

The reconnect phase, when an MDS is restarted or dies and the clients have to 
connect to a different daemon, has been made much faster in the typical case. 
(Yan, Zheng)

===========================================================================
Administrator features & improvements:

The MDS has gained an OpTracker, with functionality similar to that in the OSD. 
You can dump in-flight requests and notably slow ones from the recent past. The 
changes to enable this also made working with many code paths a lot easier.

We’ve changed how you create and manage CephFS file systems in a cluster. (John 
Spray) The “data” and “metadata” pools are no longer created by default, and 
the management is done via monitor commands that start with “ceph fs” (eg, 
“ceph fs new”). These have been designed with future extensions in mind, but 
for now they mostly replicate existing features with more consistency and 
improved repeatability/idempotency.

The MDS now reports on a variety of health metrics to the monitor, joining the 
existing OSD and monitor health reports. These include information on 
misbehaving clients and MDS data structures. (John Spray)

The MDS admin socket now includes a bunch of new commands. You can examine and 
evict client sessions, plus do things around filesystem repair (see below).

The MDS now gathers metadata from the clients about who they are and shares 
that with users via a variety of helpful interfaces and warning messages. (John 
Spray)

===========================================================================
Recovery tools

We have a new MDS journal format and a new cephfs-journal-tool. (John Spray) 
This eliminates the days of needing to hexedit a journal dump in order to let 
your MDS start back up — you can inspect the journal state (human-readable or 
json, great for our testing!) and make changes on a per-event level. It also 
includes the ability to scan through hopelessly broken journals and parse out 
whatever data is available for flushing to the backing RADOS objects.
Similarly, there’s a cephfs-table-tool for working with the SessionTable, 
InoTable, and SnapTable. (John Spray)

We’ve added new “scrub_path” and “flush_path” commands to the admin socket. 
These are fairly limited right now but will check that both directories and 
files are self-consistent. It’s a building block for the "forward scrub" and 
fsck features that I’ve been working on, and includes a lot of code-level work 
to enable those.

===========================================================================
Performance improvements

Both the kernel and userspace clients are a lot more efficient with some of 
their “capability” and directory content handling. This lets them serve a lot 
more out of local cache, a lot more often, than they were able to previously. 
This is particularly noticeable in workloads where a single client “owned” a 
directory but another client periodically peeked in on it.
There are also a bunch of extra improvements in this area that have gone in 
since Hammer and will be released in Infernalis. ;)

The code in the MDS that handles the journaling has been split into a separate 
thread. (Yan, Zheng) This has increased maximum throughput a fair bit and is 
the first major improvement enabled by John’s work to start breaking down the 
big MDS lock. (We still have a big MDS lock, but in addition to the journal it 
no longer covers the Objecter. Setting up the interfaces to make that 
manageable should make future lock sharding and changes a lot simpler than they 
would have been previously.)

===========================================================================
Developer & test improvements

In addition to a slightly expanded set of black-box tests, we are now testing 
FS behaviors to make sure everything behaves as expected in specific scenarios 
(failure and otherwise). This is largely thanks to John, but we’re doing more 
with it in general as we add features that can be tested this way.

As alluded to in previous sections, we’ve done a lot of work that makes the MDS 
codebase a lot easier to work with. Interfaces, if not exactly bright and 
shining, are a lot cleaner than they used to be. Locking is a lot more explicit 
and easier to reason about in many places. There are fewer special paths for 
specific kinds of operations, and a lot more shared paths that everything goes 
through — which means we have more invariants we can assume on every operation.

===========================================================================
Notable bug reductions

Although we continue to leave snapshots disabled by default and don’t recommend 
multi-MDS systems, both of these have been *dramatically* improved by Zheng’s 
hard work. Our multimds suite now passes almost all of the existing tests, 
whereas it previously failed most of them 
(http://pulpito.ceph.com/?suite=multimds). Our snapshot tests pass reliably and 
using them is no longer a shortcut to breaking your system, and bugs are less 
likely to leave your entire filesystem inaccessible.


There’s a lot more I haven’t discussed above, like how the entire stack is a 
lot more tolerant of failures elsewhere than it used to be and so bugs are less 
likely to make your entire filesystem inaccessible. But those are some of the 
biggest features and improvements that users are likely to notice or might have 
been waiting on before they decided to test it out. It’s nice to reflect 
occasionally — I knew we were getting a lot done, but this list is much longer 
than I’d initially thought it would be!
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to