Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning
On Mon, Jul 29 2013 at 2:49pm -0400, Daniel P. Berrange wrote: > On Mon, Jul 29, 2013 at 02:38:23PM -0400, Ric Wheeler wrote: > > On 07/29/2013 10:18 AM, Daniel P. Berrange wrote: > > >On Mon, Jul 29, 2013 at 08:01:23AM -0600, Chris Murphy wrote: > > >>On Jul 29, 2013, at 6:30 AM, "Daniel P. Berrange" > >>redhat.com> wrote: > > >> > > >>>Yep, we need to be able to report free space on filesystems, so that > > >>>apps provisioning virtual machines can get an idea of how much storage > > >>>they can provide to VMs without risk of over comitting. > > >>> > > >>>I agree that we really want the kernel, or at least a reusable shared > > >>>library, to provide some kind of interface to determine this, rather > > >>>than requiring every userspace app which cares to re-invent the wheel. > > >>What does it mean for an app to use stat to get free space, and then > > >>proceeds to create too big a VM image in a directory that has a quota > > >>set? I still think apps are asking an inappropriate/unqualified question > > >>by asking for volume free space, instead of what's available to them for > > >>a specified path. > > > From an API POV, libvirt doesn't need/care about the free space on the > > >volume underlying the filesystem. We actually only care about the free > > >space in a given directory that we're using for disk images. It just > > >happens that we implement this using statvfs() currently. So when I > > >ask for an API above, don't take this to mean I want a statvfs() that > > >knows about sparse volumes. An API or syscall that provides free space > > >for individual directories is fine with me. > > > > > > > Just another note, it is never safe to assume that storage under any > > file system is yours for the taking. > > > > If application A does a stat or statvfs() call, sees 1GB of space > > left and then does a write, we could easily lose that race to any > > other application. > > This race doesn't matter from libvirt's POV. It is just providing a > mechanism via its API. It is upto the management application using > libvirt to make use of the mechanism to provide a usage policy. > Their usage scenario may well enable them to make certain assumptions > about the storage that you could not otherwise do in a race free > manner. > > In addition, even in more general purpose usage scenarios, it does > not neccessarily matter if there is a race, because there can be a > second line of defence. For example, KVM can be set to pause the VM > upon ENOSPC errors, giving management application or administrator > the chance to expand capacity the underlying storage and then unpause > the guest. In that case checking the free space is mostly just a > sanity check which serves to avoid hitting the pause-on-ENOSPC scenario > too frequently. Running out of free space _should_ be extremely rare. A properly configured dm-thin pool will have adequate free space, with an appropriate low water mark, that would give admins ample time to extend (even if a human were to do it). But lvm2 has support to autoextend the thin-pool with free space in the parent volume group. But I'm just talking about the not-really-chicken solution of leaning on a properly configured system (either by admins in a data center or by fedora developers with sane defaults). As an aside, this extra free space checking that KVM is doing is really broken by design (polling sucks -- especially if this polling is happening in the host for each guest). Would be much better to leverage something like lvm2 with a custom dmeventd plugin that fires when it receives the low watermark and/or -ENOSPC event. Thinly provisioned volumes offer the prospect of doing away with this polling -- as such proper dm-thin integration has been on the virt roadmap for a while. Just never seems to happen. Mike -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning
On Mon, Jul 29 2013 at 2:48pm -0400, Eric Sandeen wrote: > On 7/27/13 11:56 AM, Lennart Poettering wrote: > > On Fri, 26.07.13 22:13, Miloslav Trmač (mitr at volny.cz) wrote: > > > >> Hello all, > >> with thin provisioning available, the total and free space values > >> reported by a filesystem do not necessarily mean that that much space > >> is _actually_ available (the actual backing storage may be smaller, or > >> shared with other filesystems). > >> > >> If your package reports disk space usage to users, and bases this on > >> filesystem free space, please consider whether it might need to take > >> LVM thin provisioning into account. > >> > >> The same applies if your package automatically allocates a certain > >> proportion of the total or available space. > >> > >> A quick way to check whether your package is likely to be affected, is > >> to look for statfs() or statvfs() calls in C, or the equivalent in > >> your higher-level library / programming language. > > > > Well, I am pretty sure the burden must be on the file systems to report > > a useful estimate free blocks value in statfs()/statvfs(). Exporting that > > problem to userspace and expecting userspace to work around this is just > > wrong. In fact, this would be quite an API breakage if applications > > cannot rely that the value returned is at least a rough estimate on how > > much data can be stored on disk. > > > > journald will scale how much disk usage it will use of /var/log/journal > > based on the file system size and free level. It will also module the > > per-service rate limit levels based on the amount of free disk space. If > > you break the API of statfs()/statvfs(), then you will end up break this > > and all programs like it. > > Any program needs to be prepared for ENOSPC; as Ric mentioned elsewhere, > until you successfully write to it, it's not yours! :) (Ok, thinp > running out of space won't generate ENOSPC today, either, but you see > my general point...) > > And how much space are we really talking about here? If you're running > thin-provisioning on thin margins, especially w/o some way to automatically > hot-add storage, you're probably doing it wrong. > > (And if journald sees "100T free" and decides it can use 50T of that, > it's doing it wrong, too) ;) > > The truth is somewhere in the middle, but quibbling over whether this > app or that can claim a bit of space behind a thin-provisioned volume > probably isn't useful. Right, so picking up on what we've discussed: adding the ability to have fallocate propagate to the underlying storage via a new REQ_RESERVE bio (if the storage opts-in, which dm-thinp could). This bio would be the reciprocal of discard -- thus enabling the caller to efficiently reserve space in the underlying storage (e.g. dm-thin-pool). So volumes or apps (e.g. journald) that _expect_ to have fully-provisioned space from thinp could. This would also allow for a hyrid setup where the thin-pool is configured to use a smaller block size to benefit taking many snapshots -- but then allows select apps and/or volumes to reserve contiguous space from the thin-pool. It obviously also offers the other traditional fallocate benefits too (reserving large contiguous space for performance, etc). I'll draft an RFC patch or 2 for LKML... may take some time for me to get to it but I can make it a higher priority if others have serious interest. > The admin definitely needs tools to see the state of thinly provisioned > storage, but that's the admin's job to worry about, not the app's, IMHO. Yeah, in a data center the admin really should be all over these thinp concerns, making them a non-issue. But on the desktop the fedora developers need to provide sane policy/defaults. Mike -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning
On Wed, Jul 31 2013 at 5:52am -0400, Zdenek Kabelac wrote: > Dne 31.7.2013 10:39, Florian Weimer napsal(a): > >On 07/29/2013 08:38 PM, Ric Wheeler wrote: > > > >>If application A does a stat or statvfs() call, sees 1GB of space left > >>and then does a write, we could easily lose that race to any other > >>application. > >> > >>If you want to reserve space, you need to grab the space yourself > >>(always works with a large "write()" but preallocation can also help > >>without dm-thin). > > > >In order to have it work "always", you'll have to write unpredictable data. > >If you write just zeros, the reservation isn't guaranteed if the file system > >supports compression. > > > >I'm pretty sure we want a crass layering violation for this one (probably a > >new mode flag for fallocate), to ensure proper storage reservation for things > >like VM images. > > > If someone wants to use preallocation - thus always allocate whole space, > than there is no reason to use provisioned devices unless someone > want's to use its snapshot feature (for this there could be > probably introduced something like creation of fully provisioned > device) - but then you end-up > with the same problem once you start to use snapshot. > > For me this rather looks like misuse of thin provisioning. > > ThinP should be configured in a way that admin is able to extend > pool to honour promised space if really needed. It's not a good > idea, to provision 1EB if you have at most just 1TB disk and then > you expect you will have no problems when someone fallocate() 500TB. fallocate doesn't allow you to reserve more physical space than you have (or allowed via quota). > I.e. if someone is using iSCSI disc array with 'hw' thin > provisioning support, there is no scsi command to provision space - > it's admin's work to ensure there is enough disc space to keep up > with user demands > > Maybe - just an idea - there could be a kernel bit-flag somewhere, > which might tell if the device used by fs is 'fully provisioned' or > 'thin provisioned' (something like rotational/non-rotational) But > there is no way to return information about free disc space - since > it's highly subjective value and moreover very expensive to > calculate. If things like journald _need_ to have a sysfs flag that denotes the volume it is writing to is thinp then I'd like to understand what it'd do differently knowing that info. Would it conditionally call fallocate() -- assuming dm-thinp grows REQ_RESERVE support like I mentioned in my previous post. I see little value in exposing whether some portion of the storage stack is thin or not. What is an app to do with that info? It'd have to do things like: 1) determine the blockdevice the FS is layered on 2) lookup sysfs file for that device.. a filesystem can span multiple devices.. some time some not. It is just a rat's nest. Thinly provisioned storage this isn't exactly a new concept. But the Linux provided target obviously engages other parts of the OS to properly support it (at a minimum the volume manager and the installer). If the fallocate() triggered REQ_RESERVE passdown to the underlying storage provides a reasonable stop gap we can really explore it -- at least we'd be piggybacking on an established interface that returns success or failure. Mike -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning
On Wed, Jul 31 2013 at 1:08pm -0400, Chris Murphy wrote: > > On Jul 31, 2013, at 8:32 AM, Mike Snitzer wrote: > > > But on the desktop the fedora > > developers need to provide sane policy/defaults. > > Right. And the concern I have (other than a blatant bug), is the F20 > feature for the installer to create thinp LVs; and to do that the > installer needs to know what sane default parameters are. I think > perhaps determining those defaults is non-obvious because of my > experience in this bug: > https://bugzilla.redhat.com/show_bug.cgi?id=984236 Hmm, certainly a strange one. But some bugs can be. Did you ever look to see if CONFIG_DM_DEBUG_BLOCK_STACK_TRACING is enabled? Wouldn't explain any dmeventd memleak issues but could help explain slowness associated with mkfs.btrfs ontop of thinp. Anyway, to be continued in the BZ... > If I'm going to use thinP mostly for snapshots, then that suggests a > smaller chunk size at thin pool creation time; whereas if I have no > need for snapshots but a greater need for provisioning then a larger > chunk size is better. And asking usage context in the installer, I > think is a problem. It is certainly less than ideal but we haven't come up with an alternative yet. As Zdenek mentioned in comment#13 of the BZ you referenced, we're looking to do is establish default profiles for at least these 2 use-cases you mentioned. lvm2 has recently grown profile support. We just need to come to terms with what constitutes sufficiently small and sufficently large thinp block sizes. We're doing work to zero in on the best defaults... so ultimately this is still up in the air. But my current thinking for these 2 profiles is: * for performance, use data device's optimal_io_size if > 64K. - this will yield a thinp block_size that is a full stripe on RAID[56] * for snapshots, use data device's minimum_io_size if > 64K. If/when we have the kernel REQ_RESERVE support to prealloc space in the thin-pool it _could_ be that we make the snapshots profile the default; and anything that wanted more performance could use fallocate(). But it is a slippery slope because many apps could overcompensate to always use fallocate()... we really don't want that. So some form of quota might need to be enforced on a cgroup level (once cgroup's reservation quota is exceeded fallocate()'s REQ_RESERVE bio pass down will be skipped). Grafting in cgroup-based policy into DM is a whole other can of worms, but doable. Open to other ideas... Mike -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning
On Wed, Jul 31 2013 at 2:38pm -0400, Eric Sandeen wrote: > On 7/31/13 12:08 PM, Chris Murphy wrote: > > > > On Jul 31, 2013, at 8:32 AM, Mike Snitzer > > wrote: > > > >> But on the desktop the fedora developers need to provide sane > >> policy/defaults. > > > > Right. And the concern I have (other than a blatant bug), is the F20 > > feature for the installer to create thinp LVs; and to do that the > > installer needs to know what sane default parameters are. I think > > perhaps determining those defaults is non-obvious because of my > > experience in this bug: > > https://bugzilla.redhat.com/show_bug.cgi?id=984236 > > > > If I'm going to use thinP mostly for snapshots, then that suggests a > > smaller chunk size at thin pool creation time; whereas if I have no > > need for snapshots but a greater need for provisioning then a larger > > chunk size is better. And asking usage context in the installer, I > > think is a problem. > > Quite some time ago I had asked whether we could get the allocation-tracking > snapshot niceties from dm-thinp, without actually needing it to be thin. > > i.e. if you only want the efficient snapshots, a way to fully-provision > a "thinp" device. I'm still not sure if this is possible...? TBD, we could add a "sandeen_makes_thinp_his_bitch" param and if specified (likely for entire pool, but we'll see) it would mean thin volumes allocating from the pool would have their logical address space reserved to be completey contiguous on creation (with all thin blocks flagged in metadata as RESERVED). The actual thin block allocation (zeroing of blocks on first write if configured, etc.) transitions the metadata's block from RESERVED to PROVISIONED. Not yet clear to me that the DM thinp code can be easily adapted to make the thin block allocation 2 staged. But would seem to be a prereq for dm-thinp's REQ_RESERVE support. I'll check with Joe (cc'd) and come back with his dose of reality ;) > I guess I'm pretty nervous about offering actual thin provisioned > storage to "average" Fedora users. I'm having nightmares about the "bug" > reports already, just based on the likelihood of most users misunderstanding > the feature and it's requirements & expected behavior... Heh, you shouldn't be nervous. You can just punt said bugs over the fence right? ;) Mike -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Re: Does your application depend on, or report, free disk space? Re: F20 Self Contained Change: OS Installer Support for LVM Thin Provisioning
On Wed, Jul 31 2013 at 5:53pm -0400, Chris Murphy wrote: > > On Jul 31, 2013, at 12:38 PM, Eric Sandeen wrote: > > > > > i.e. if you only want the efficient snapshots, a way to fully-provision > > a "thinp" device. I'm still not sure if this is possible…? > > […] > > > > > I guess I'm pretty nervous about offering actual thin provisioned > > storage to "average" Fedora users. I'm having nightmares about the "bug" > > reports already, just based on the likelihood of most users misunderstanding > > the feature and it's requirements & expected behavior… > > So possibly the installer should be conservative about how thin the > provisioning is; We (David Lehman, myself and others on our respective teams) have already decided some months ago that any thin LVs that anaconda establishes will _not_ oversubscribe the thin-pool. And in fact a reserve of free space will be kept in the thin-pool as well as the parent VG. > otherwise I'm imagining inadequately provisioned thinp LV, while also > using the rollback feature [1]. Can you elaborate? Rollback with LVM thin provisioning doesn't require any additional space in the pool. It is a simple matter of swapping the internal device_ids that the thin-pool uses as an index to access the corresponding thin volumes. This is done when activating the thin volumes. LVM2's support thinp snapshot merge (aka rollback) is still pending, but RFC patches have been published via this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=957881 > [1] https://fedoraproject.org/wiki/Changes/Rollback The Rollback project authors have been having periodic concalls with David Lehman, myself and others. So we are relatively coordinated ;) Mike -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct