Re: [zfs-discuss] Experiences with 10.000+ filesystems

Jim Klimov Tue, 31 May 2011 10:42:09 -0700

In general, you may need to keep data in one dataset if it is somehow
related (i.e. backup of a specific machine or program, a user's home, etc) 
and if you plan to manage it in a consistent manner. For example, CIFS
shares can not be nested, so for a unitary share (like "distribs") you would
probably want one dataset. Also you can only have hardlinks within one
FS dataset, so if you manage different views into a distribution set
(i.e. sorted by vendor or sorted by software type) and if you do it
by hardlinks - you need one dataset as well. If you often move (link
and unlink) files around, i.e. from an "incoming" directory to final
storage, you may want or not want to have that "incoming" in the
same dataset, this depends on some other considerations too.
 
You want to split datasets when you need them to have different
features and perhaps different uses, i.e. to have them as separate
shares, to enforce separate quotas and reservations, perhaps to
delegate administration to particular OS users (i.e. let a user manage
snapshots of his own homedir) and/or local zones. Don't forget
about individual dataset properties (i.e. you may want compression
for source code files but not for a multimedia collection), snapshots
and clones, etc.
 
> 2. space management (we have wasted space in some pools while others
> are starved)
Well, that's a reason to decrease number of pools, but not datasets ;)
 
> 3. tool speed
> 
>     I do not have good numbers for time to do 
> some of these operations
> as we are down to under 200 datasets (1/3 of the way through the
> migration to the new layout). I do have log entries that point to
> about a minute to complete a `zfs list` operation.
> 
> > Would I run into any problems when snapshots are taken (almost)
> > simultaneously from multiple filesystems at once?
> 
>     Our logs show snapshot creation time at 2 
> seconds or less, but we
> do not try to do them all at once, we walk the list of datasets and
> process (snapshot and replicate) each in turn.


I can partially relate to that. We have a Thumper system running
OpenSolaris SXCE snv_177, with a separate dataset for each
user's home directory, for backups of each individual remote
machine, for each VM image, each local zone, etc. - in particular 
as to have separate history of snapshots and possibility to clone
what we need to.
 
Its relatively many filesystems (about 350) are or are not a problem 
depending on the tool used. For example, a typical import of the 
main pool may take up to 8 minutes when in safe mode,  but many 
of delays seem to be related to attempts to share_nfs and share_cifs
while the network is down ;)
 
Auto-snapshots are on, and listing them is indeed rather long:
 
[root@thumper ~]# time zfs list -tall -r pond | wc -l
   56528
real    0m18.146s
user    0m7.360s
sys     0m10.084s

[root@thumper ~]# time zfs list -tvolume -r pond | wc -l
       5
real    0m0.096s
user    0m0.025s
sys     0m0.073s

[root@thumper ~]# time zfs list -tfilesystem -r pond | wc -l
     353
real    0m0.123s
user    0m0.052s
sys     0m0.073s

Some operations like listing the filesystems SEEM slow due to the terminal,
but in fact are rather quick:
 
[root@thumper ~]# time df -k | wc -l
     363
real    0m2.104s
user    0m0.094s
sys     0m0.183s

However low-level system programs may have problems with multiple 
FSes; one known troublemaker is LiveUpgrade. Jens Elkner published
a wonderful set of patches for Solaris 10 and OpenSolaris to limit LU's
interests to just the filesystems that the admin knows to be interesting
for the OS upgrade (they also fix mount order and other known bugs
of that LU software release):
* http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html
 
True, 10000 FSes is not something I would have seen, so some tools
(especially legacy ones) may break at the sheer amount of mountpoints :)
 
One of my own tricks for cleaning snapshots, i.e. to free up pool space 
starvation quickly, is to use parallel "zfs destroy" invokations like this 
(note the ampersand):
 
# zfs list -t snapshot -r pond/export/home/user | grep @zfs-auto-snap | awk 
'{print $1}' | \
  while read Z ; do zfs destroy "$Z" & done
 
This may spawn several thousand processes (if called for the root dataset), 
but they often complete in just 1-2 minutes instead of hours for a one-by-one 
series of calls; I guess because this way many ZFS metadata operations 
are requested in a small timeframe and get coalesced into few big writes.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Experiences with 10.000+ filesystems

Reply via email to