On 2012-10-23 20:06, Jim Klimov wrote:
2012-10-23 19:53, Robin Axelsson wrote:
That sounds like a good point, unless you first scan for hard links and
avoid touching the files and their hard links in the shell script, I guess.

I guess the idea about reading into memory and writing back into the same file (or "cat $SRC > /var/tmp/$SRC && cat /var/tmp/$SRC > $SRC"
to be on the safer side) should take care of hardlinks, since the
inode would stay the same. You should of course ensure that nobody
uses the file in question (i.e. databases are down, etc). You can
also keep track of "rebalanced" inode numbers to avoid processing
hardlinked files more than once.

ZFS send/recv should also take care of these things, and with
sufficient space in the pool to ensure "even" writes (i.e. just
after expansion with new VDEVs) it can be done within the pool if
you don't have a spare one. Then you can ensure all needed "local"
dataset properties are transfered, remove the old dataset and
rename the new copy to its name (likewise for hierarchies of
datasets).

But if I do send/receive to the same pool I will need to have enough free space in it to fit at least two copies of the dataset I want to reallocate.

But I heard that a pool that is almost full have some performance
issues, especially when you try to delete files from that pool. But
maybe this becomes a non-issue once the pool is expanded by another vdev.

This issue may remain - basically, when a pool is nearly full (YMMV,
empirically over 80-90% for pools with many write-delete cycles,
but there were reports of even 60% full being a problem), its block
allocation may look like good cheese with many tiny holes. Walking
the free space to find a hole big enough to write a new block takes
time, hence the slowdown. When you expand the pool with a new vdev,
the old full cheesy one does not go away, and writes that ZFS pipe
line intended to put there would still lag (and may now time out and
may get to another vdev, as someone else mentioned in this thread).


It seems like what zfs is missing here is a good defrag tool.


To answer your other letters,

> But if I have two raidz3 vdevs, is there any way to create an
> isolation/separation between them so that if one of them fails, only the
> data that is stored within that vdev will be lost and all data that
> happen to be stored in the other can be recovered? And yet let them both
> be accessible from the same path?
>
> The only thing that needs to be sorted out is where the files should go
> when you write to that path and avoid splitting such that one half if
> the file goes to one vdev and another goes to the other vdev. Maybe
> there is some disk or i/o scheduler that can handle such operations?

You can't do that. A pool is one whole (you can't also remove vdevs
from it and you can't change or reduce raidzN groups' redundancy -
may be that will change after the long-awaited BPR = block-pointer
rewriter is implemented by some kind samaritan), and as soon as it
is set up or expanded all writes go striped to all components and
all top-level components are required not-failed to import the pool
and use it.

It would be interesting to know how you convert a raidz2 stripe to say a raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra parity drive by converting it to a raidz3 pool. I'm imagining that would be like creating a raidz1 pool on top of the leaf vdevs that constitutes the raidz2 pool and the new leaf vdev which results in an additional parity drive. It doesn't sound too difficult to do that. Actually, this way you could even get raidz4 or raidz5 pools. Question is though, how things would pan out performance wise, I would imagine that a 55 drive raidz25 pool is really taxing on the CPU.

Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a no-brainer; you just remove one drive from the pool and force zpool to accept the new state as "normal".

But expanding a raidz pool with additional storage while preserving the parity structure sounds a little bit trickier. I don't think I have that knowledge to write a bpr rewriter although I'm reading Solaris Internals right now ;)

> I can't see how a dataset can span over several zpools as you usually
> create it with mypool/datasetname (in the case of a file system
> dataset). But I can see several datasets in one pool though (e.g.
> mypool/dataset1, mypool/dataset2 ...). So the relationship I see is pool
> *onto* dataset.

It can't. A dataset is contained in one pool. Many datasets can
be contained in one pool and share the free space, dedup table and
maybe some other resources. Datasets contained in different pools
are unrelated.

> But if I have two separate pools with separate names, say mypool1 and
> mypool2 I could create a zfs file system dataset with the same name in
> each of these pools and then give these two datasets the same
> "mountpoint" property couldn't I? Then they would be forced to be
> mounted to the same path.

One at a time - yes. Both at once (in a useful manner) - no.
If the mountpoint is not empty, zfs refuses to mount the dataset.
Even if you force it to (using overlay mount -o), the last mounted
dataset's filesystem will be all you'd see.

You can however mount other datasets into logical "subdirectories"
of the dataset you need to "expand", but those subs must be empty
or nonexistant in your currently existing "parent" dataset. Also
the new "children" are separate filesystems, so it is your quest
to move data into them if you need to free up the existing dataset,
and in particular remember that inodes of different filesystems
are unrelated, so hardlinks will break for those files that would
be forced to split from one inode in the source filesystem to
several inodes (i.e. some pathnames in the source FS and some in
the child) - like for any other FS boundary crossings.

* Can several datasets be mounted to the same mount point, i.e. can multiple "file system"-datasets be mounted so that they (the root of them) are all accessed from exactly the same (POSIX) path and subdirectories with coinciding names will be merged? The purpose of this would be to seamlessly expand storage capacity this way just like when adding vdevs to a pool.

What you describe here is known as unionfs in Linux, among others.
I think there were RFEs or otherwise expressed desires to make that
in Solaris and later illumos (I did campaign for that sometime ago),
but AFAIK this was not yet done by anyone.

YES, UnionFS-like functionality is what I was talking about. It seems like it has been abandoned in favor of AuFS in the Linux and the BSD world. It seems to have functions that are a little overkill to use with zfs, such as copy-on-write. Perhaps a more simplistic implementation of it would be more suitable for zfs.

Perhaps a similar functionality can be established through an abstraction layer behind network shares.

In Windows this functionality is called 'disk pooling', btw.
* If that's the case how will the data be distributed/allocated over the datasets if I copy a data file to that path?
N/A.


HTH,
//Jim

.




_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to