In the style of a discussion over a beverage, and talking about
user-quotas on ZFS, I recently pondered a design for implementing user
quotas on ZFS after having far too little sleep.
It is probably nothing new, but I would be curious what you experts
think of the feasibility of implementing such a system and/or whether or
not it would even realistically work.
I'm not suggesting that someone should do the work, or even that I will,
but rather in the interest of chatting about it.
Feel free to ridicule me as required! :)
Thoughts:
Here at work we would like to have user quotas based on uid (and
presumably gid) to be able to fully replace the NetApps we run. Current
ZFS are not good enough for our situation. We simply can not mount
500,000 file-systems on all the NFS clients. Nor do all servers we run
support mirror-mounts. Nor do auto-mount see newly created directories
without a full remount.
Current UFS-style-user-quotas are very exact. To the byte even. We do
not need this precision. If a user has 50MB of quota, and they are able
to reach 51MB usage, then that is acceptable to us. Especially since
they have to go under 50MB to be able to write new data, anyway.
Instead of having complicated code in the kernel layer, slowing down the
file-system with locking and semaphores (and perhaps avoiding learning
indepth ZFS code?), I was wondering if a more simplistic setup could be
designed, that would still be acceptable. I will use the word
'acceptable' a lot. Sorry.
My thoughts are that the ZFS file-system will simply write a
'transaction log' on a pipe. By transaction log I mean uid, gid and
'byte count changed'. And by pipe I don't necessarily mean pipe(2), but
it could be a fifo, pipe or socket. But currently I'm thinking
'/dev/quota' style.
User-land will then have a daemon, whether or not it is one daemon per
file-system or really just one daemon does not matter. This process will
open '/dev/quota' and empty the transaction log entries constantly. Take
the uid,gid entries and update the byte-count in its database. How we
store this database is up to us, but since it is in user-land it should
have more flexibility, and is not as critical to be fast as it would
have to be in kernel.
The daemon process can also grow in number of threads as demand increases.
Once a user's quota reaches the limit (note here that /the/ call to
write() that goes over the limit will succeed, and probably a couple
more after. This is acceptable) the process will "blacklist" the uid in
kernel. Future calls to creat/open(CREAT)/write/(insert list of calls)
will be denied. Naturally calls to unlink/read etc should still succeed.
If the uid goes under the limit, the uid black-listing will be removed.
If the user-land process crashes or dies, for whatever reason, the
buffer of the pipe will grow in the kernel. If the daemon is restarted
sufficiently quickly, all is well, it merely needs to catch up. If the
pipe does ever get full and items have to be discarded, a full-scan will
be required of the file-system. Since even with UFS quotas we need to
occasionally run 'quotacheck', it would seem this too, is acceptable (if
undesirable).
If you have no daemon process running at all, you have no quotas at all.
But the same can be said about quite a few daemons. The administrators
need to adjust their usage.
I can see a complication with doing a rescan. How could this be done
efficiently? I don't know if there is a neat way to make this happen
internally to ZFS, but from a user-land only point of view, perhaps a
snapshot could be created (synchronised with the /dev/quota pipe
reading?) and start a scan on the snapshot, while still processing
kernel log. Once the scan is complete, merge the two sets.
Advantages are that only small hooks are required in ZFS. The byte
updates, and the blacklist with checks for being blacklisted.
Disadvantages are that it is loss of precision, and possibly slower
rescans? Sanity?
But I do not really know the internals of ZFS, so I might be completely
wrong, and everyone is laughing already.
Discuss?
Lund
--
Jorgen Lundman | <lund...@lundman.net>
Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell)
Japan | +81 (0)3 -3375-1767 (home)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss