Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

2013-07-31 Thread Mike Snitzer
On Mon, Jul 29 2013 at  2:49pm -0400,
Daniel P. Berrange  wrote:

> On Mon, Jul 29, 2013 at 02:38:23PM -0400, Ric Wheeler wrote:
> > On 07/29/2013 10:18 AM, Daniel P. Berrange wrote:
> > >On Mon, Jul 29, 2013 at 08:01:23AM -0600, Chris Murphy wrote:
> > >>On Jul 29, 2013, at 6:30 AM, "Daniel P. Berrange"  > >>redhat.com> wrote:
> > >>
> > >>>Yep, we need to be able to report free space on filesystems, so that
> > >>>apps provisioning virtual machines can get an idea of how much storage
> > >>>they can provide to VMs without risk of over comitting.
> > >>>
> > >>>I agree that we really want the kernel, or at least a reusable shared
> > >>>library, to provide some kind of interface to determine this, rather
> > >>>than requiring every userspace app which cares to re-invent the wheel.
> > >>What does it mean for an app to use stat to get free space, and then
> > >>proceeds to create too big a VM image in a directory that has a quota
> > >>set? I still think apps are asking an inappropriate/unqualified question
> > >>by asking for volume free space, instead of what's available to them for
> > >>a specified path.
> > > From an API POV, libvirt doesn't need/care about the free space on the
> > >volume underlying the filesystem. We actually only care about the free
> > >space in a given directory that we're using for disk images. It just
> > >happens that we implement this using statvfs() currently. So when I
> > >ask for an API above, don't take this to mean I want a statvfs() that
> > >knows about sparse volumes. An API or syscall that provides free space
> > >for individual directories is fine with me.
> > >
> >
> > Just another note, it is never safe to assume that storage under any
> > file system is yours for the taking.
> > 
> > If application A does a stat or statvfs() call, sees 1GB of space
> > left and then does a write, we could easily lose that race to any
> > other application.
> 
> This race doesn't matter from libvirt's POV. It is just providing a
> mechanism via its API. It is upto the management application using
> libvirt to make use of the mechanism to provide a usage policy.
> Their usage scenario may well enable them to make certain assumptions
> about the storage that you could not otherwise do in a race free
> manner.
> 
> In addition, even in more general purpose usage scenarios, it does
> not neccessarily matter if there is a race, because there can be a
> second line of defence. For example, KVM can be set to pause the VM
> upon ENOSPC errors, giving management application or administrator
> the chance to expand capacity the underlying storage and then unpause
> the guest. In that case checking the free space is mostly just a
> sanity check which serves to avoid hitting the pause-on-ENOSPC scenario
> too frequently.

Running out of free space _should_ be extremely rare.  A properly
configured dm-thin pool will have adequate free space, with an
appropriate low water mark, that would give admins ample time to extend
(even if a human were to do it).  But lvm2 has support to autoextend the
thin-pool with free space in the parent volume group.

But I'm just talking about the not-really-chicken solution of leaning on
a properly configured system (either by admins in a data center or by
fedora developers with sane defaults).

As an aside, this extra free space checking that KVM is doing is really
broken by design (polling sucks -- especially if this polling is
happening in the host for each guest).  Would be much better to leverage
something like lvm2 with a custom dmeventd plugin that fires when it
receives the low watermark and/or -ENOSPC event.

Thinly provisioned volumes offer the prospect of doing away with this
polling -- as such proper dm-thin integration has been on the virt
roadmap for a while.  Just never seems to happen.

Mike
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct

Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

2013-07-31 Thread Mike Snitzer
On Mon, Jul 29 2013 at  2:48pm -0400,
Eric Sandeen  wrote:

> On 7/27/13 11:56 AM, Lennart Poettering wrote:
> > On Fri, 26.07.13 22:13, Miloslav Trmač (mitr at volny.cz) wrote:
> > 
> >> Hello all,
> >> with thin provisioning available, the total and free space values
> >> reported by a filesystem do not necessarily mean that that much space
> >> is _actually_ available (the actual backing storage may be smaller, or
> >> shared with other filesystems).
> >>
> >> If your package reports disk space usage to users, and bases this on
> >> filesystem free space, please consider whether it might need to take
> >> LVM thin provisioning into account.
> >>
> >> The same applies if your package automatically allocates a certain
> >> proportion of the total or available space.
> >>
> >> A quick way to check whether your package is likely to be affected, is
> >> to look for statfs() or statvfs() calls in C, or the equivalent in
> >> your higher-level library / programming language.
> > 
> > Well, I am pretty sure the burden must be on the file systems to report
> > a useful estimate free blocks value in statfs()/statvfs(). Exporting that
> > problem to userspace and expecting userspace to work around this is just
> > wrong. In fact, this would be quite an API breakage if applications
> > cannot rely that the value returned is at least a rough estimate on how
> > much data can be stored on disk.
> > 
> > journald will scale how much disk usage it will use of /var/log/journal
> > based on the file system size and free level. It will also module the
> > per-service rate limit levels based on the amount of free disk space. If
> > you break the API of statfs()/statvfs(), then you will end up break this
> > and all programs like it.
> 
> Any program needs to be prepared for ENOSPC; as Ric mentioned elsewhere,
> until you successfully write to it, it's not yours! :)  (Ok, thinp
> running out of space won't generate ENOSPC today, either, but you see
> my general point...)
> 
> And how much space are we really talking about here?  If you're running
> thin-provisioning on thin margins, especially w/o some way to automatically 
> hot-add storage, you're probably doing it wrong.
> 
> (And if journald sees "100T free" and decides it can use 50T of that,
> it's doing it wrong, too) ;)
> 
> The truth is somewhere in the middle, but quibbling over whether this
> app or that can claim a bit of space behind a thin-provisioned volume
> probably isn't useful.

Right, so picking up on what we've discussed: adding the ability to have
fallocate propagate to the underlying storage via a new REQ_RESERVE bio
(if the storage opts-in, which dm-thinp could).  This bio would be the
reciprocal of discard -- thus enabling the caller to efficiently reserve
space in the underlying storage (e.g. dm-thin-pool).  So volumes or apps
(e.g. journald) that _expect_ to have fully-provisioned space from thinp
could.

This would also allow for a hyrid setup where the thin-pool is
configured to use a smaller block size to benefit taking many snapshots
-- but then allows select apps and/or volumes to reserve contiguous
space from the thin-pool.  It obviously also offers the other
traditional fallocate benefits too (reserving large contiguous space for
performance, etc).

I'll draft an RFC patch or 2 for LKML... may take some time for me to
get to it but I can make it a higher priority if others have serious
interest.

> The admin definitely needs tools to see the state of thinly provisioned
> storage, but that's the admin's job to worry about, not the app's, IMHO.

Yeah, in a data center the admin really should be all over these thinp
concerns, making them a non-issue.  But on the desktop the fedora
developers need to provide sane policy/defaults.

Mike
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct

Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

2013-07-31 Thread Mike Snitzer
On Wed, Jul 31 2013 at  5:52am -0400,
Zdenek Kabelac  wrote:

> Dne 31.7.2013 10:39, Florian Weimer napsal(a):
> >On 07/29/2013 08:38 PM, Ric Wheeler wrote:
> >
> >>If application A does a stat or statvfs() call, sees 1GB of space left
> >>and then does a write, we could easily lose that race to any other
> >>application.
> >>
> >>If you want to reserve space, you need to grab the space yourself
> >>(always works with a large "write()" but preallocation can also help
> >>without dm-thin).
> >
> >In order to have it work "always", you'll have to write unpredictable data.
> >If you write just zeros, the reservation isn't guaranteed if the file system
> >supports compression.
> >
> >I'm pretty sure we want a crass layering violation for this one (probably a
> >new mode flag for fallocate), to ensure proper storage reservation for things
> >like VM images.
> 
> 
> If someone wants to use preallocation - thus always allocate whole space,
> than there is no reason to use provisioned devices unless someone
> want's to use its snapshot feature (for this  there could be
> probably introduced something like creation of fully provisioned
> device) - but then you end-up
> with the same problem once you start to use snapshot.
> 
> For me this rather looks like misuse of thin provisioning.
> 
> ThinP should be configured in a way that admin is able to extend
> pool to honour promised space if really needed. It's not a good
> idea, to provision 1EB if you have at most just 1TB disk and then
> you expect you will have no problems when someone fallocate() 500TB.

fallocate doesn't allow you to reserve more physical space than you have
(or allowed via quota).
 
> I.e. if someone is using  iSCSI disc array with 'hw' thin
> provisioning support, there is no scsi command to provision space -
> it's admin's work to ensure there is enough disc space to keep up
> with user demands
> 
> Maybe - just an idea - there could be a kernel bit-flag somewhere,
> which might tell if the device used by fs is 'fully provisioned' or
> 'thin provisioned' (something like rotational/non-rotational)  But
> there is no way to return information about free disc space - since
> it's highly subjective value and moreover very expensive to
> calculate.

If things like journald _need_ to have a sysfs flag that denotes the
volume it is writing to is thinp then I'd like to understand what it'd
do differently knowing that info.  Would it conditionally call
fallocate() -- assuming dm-thinp grows REQ_RESERVE support like I
mentioned in my previous post.

I see little value in exposing whether some portion of the storage stack
is thin or not.  What is an app to do with that info?  It'd have to do
things like: 1) determine the blockdevice the FS is layered on 2) lookup
sysfs file for that device.. a filesystem can span multiple
devices.. some time some not.  It is just a rat's nest.

Thinly provisioned storage this isn't exactly a new concept.  But the
Linux provided target obviously engages other parts of the OS to
properly support it (at a minimum the volume manager and the
installer).  If the fallocate() triggered REQ_RESERVE passdown to the
underlying storage provides a reasonable stop gap we can really explore
it -- at least we'd be piggybacking on an established interface that
returns success or failure.

Mike
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct

Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

2013-07-31 Thread Mike Snitzer
On Wed, Jul 31 2013 at  1:08pm -0400,
Chris Murphy  wrote:

> 
> On Jul 31, 2013, at 8:32 AM, Mike Snitzer  wrote:
> 
> > But on the desktop the fedora
> > developers need to provide sane policy/defaults.
> 
> Right. And the concern I have (other than a blatant bug), is the F20
> feature for the installer to create thinp LVs; and to do that the
> installer needs to know what sane default parameters are. I think
> perhaps determining those defaults is non-obvious because of my
> experience in this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=984236

Hmm, certainly a strange one.  But some bugs can be.  Did you ever look
to see if CONFIG_DM_DEBUG_BLOCK_STACK_TRACING is enabled?  Wouldn't
explain any dmeventd memleak issues but could help explain slowness
associated with mkfs.btrfs ontop of thinp.  Anyway, to be continued in
the BZ...

> If I'm going to use thinP mostly for snapshots, then that suggests a
> smaller chunk size at thin pool creation time; whereas if I have no
> need for snapshots but a greater need for provisioning then a larger
> chunk size is better. And asking usage context in the installer, I
> think is a problem.

It is certainly less than ideal but we haven't come up with an
alternative yet.  As Zdenek mentioned in comment#13 of the BZ you
referenced, we're looking to do is establish default profiles for at
least these 2 use-cases you mentioned.  lvm2 has recently grown profile
support.  We just need to come to terms with what constitutes
sufficiently small and sufficently large thinp block sizes.

We're doing work to zero in on the best defaults... so ultimately this
is still up in the air.

But my current thinking for these 2 profiles is:
* for performance, use data device's optimal_io_size if > 64K.
  - this will yield a thinp block_size that is a full stripe on RAID[56]
* for snapshots, use data device's minimum_io_size if > 64K.

If/when we have the kernel REQ_RESERVE support to prealloc space in the
thin-pool it _could_ be that we make the snapshots profile the default;
and anything that wanted more performance could use fallocate().  But it
is a slippery slope because many apps could overcompensate to always use
fallocate()... we really don't want that.  So some form of quota might
need to be enforced on a cgroup level (once cgroup's reservation quota
is exceeded fallocate()'s REQ_RESERVE bio pass down will be skipped).
Grafting in cgroup-based policy into DM is a whole other can of worms,
but doable.

Open to other ideas...

Mike
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct

Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

2013-07-31 Thread Mike Snitzer
On Wed, Jul 31 2013 at  2:38pm -0400,
Eric Sandeen  wrote:

> On 7/31/13 12:08 PM, Chris Murphy wrote:
> > 
> > On Jul 31, 2013, at 8:32 AM, Mike Snitzer 
> > wrote:
> > 
> >> But on the desktop the fedora developers need to provide sane
> >> policy/defaults.
> > 
> > Right. And the concern I have (other than a blatant bug), is the F20
> > feature for the installer to create thinp LVs; and to do that the
> > installer needs to know what sane default parameters are. I think
> > perhaps determining those defaults is non-obvious because of my
> > experience in this bug: 
> > https://bugzilla.redhat.com/show_bug.cgi?id=984236
> > 
> > If I'm going to use thinP mostly for snapshots, then that suggests a
> > smaller chunk size at thin pool creation time; whereas if I have no
> > need for snapshots but a greater need for provisioning then a larger
> > chunk size is better. And asking usage context in the installer, I
> > think is a problem.
> 
> Quite some time ago I had asked whether we could get the allocation-tracking
> snapshot niceties from dm-thinp, without actually needing it to be thin.
> 
> i.e. if you only want the efficient snapshots, a way to fully-provision
> a "thinp" device.  I'm still not sure if this is possible...?

TBD, we could add a "sandeen_makes_thinp_his_bitch" param and if
specified (likely for entire pool, but we'll see) it would mean thin
volumes allocating from the pool would have their logical address space
reserved to be completey contiguous on creation (with all thin blocks
flagged in metadata as RESERVED).

The actual thin block allocation (zeroing of blocks on first write if
configured, etc.) transitions the metadata's block from RESERVED to
PROVISIONED.  Not yet clear to me that the DM thinp code can be easily
adapted to make the thin block allocation 2 staged.

But would seem to be a prereq for dm-thinp's REQ_RESERVE support.  I'll
check with Joe (cc'd) and come back with his dose of reality ;)

> I guess I'm pretty nervous about offering actual thin provisioned
> storage to "average" Fedora users.  I'm having nightmares about the "bug"
> reports already, just based on the likelihood of most users misunderstanding
> the  feature and it's requirements & expected behavior...

Heh, you shouldn't be nervous.  You can just punt said bugs over the
fence right? ;)

Mike
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct

Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning

2013-08-01 Thread Mike Snitzer
On Wed, Jul 31 2013 at  5:53pm -0400,
Chris Murphy  wrote:

> 
> On Jul 31, 2013, at 12:38 PM, Eric Sandeen  wrote:
> 
> > 
> > i.e. if you only want the efficient snapshots, a way to fully-provision
> > a "thinp" device.  I'm still not sure if this is possible…?
> 
> […]
> 
> > 
> > I guess I'm pretty nervous about offering actual thin provisioned
> > storage to "average" Fedora users.  I'm having nightmares about the "bug"
> > reports already, just based on the likelihood of most users misunderstanding
> > the  feature and it's requirements & expected behavior…
> 
> So possibly the installer should be conservative about how thin the
> provisioning is;

We (David Lehman, myself and others on our respective teams) have
already decided some months ago that any thin LVs that anaconda
establishes will _not_ oversubscribe the thin-pool.

And in fact a reserve of free space will be kept in the thin-pool as
well as the parent VG.

> otherwise I'm imagining inadequately provisioned thinp LV, while also
> using the rollback feature [1].

Can you elaborate?  Rollback with LVM thin provisioning doesn't require
any additional space in the pool.  It is a simple matter of swapping the
internal device_ids that the thin-pool uses as an index to access the
corresponding thin volumes.  This is done when activating the thin
volumes.

LVM2's support thinp snapshot merge (aka rollback) is still pending, but
RFC patches have been published via this BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=957881
 
> [1] https://fedoraproject.org/wiki/Changes/Rollback

The Rollback project authors have been having periodic concalls with
David Lehman, myself and others.  So we are relatively coordinated ;)

Mike
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct