On Mon, 15 Jan 2018 11:49:31 +0800 Antonios Motakis <antonios.mota...@huawei.com> wrote:
> On 13-Jan-18 00:14, Greg Kurz wrote: > > On Fri, 12 Jan 2018 19:32:10 +0800 > > Antonios Motakis <antonios.mota...@huawei.com> wrote: > > > >> Hello all, > >> > > > > Hi Antonios, > > > > I see you have attached a patch to this email... this really isn't the > > preferred > > way to do things since it prevents to comment the patch (at least with my > > mail > > client). The appropriate way would have been to send the patch with a cover > > letter, using git-send-email for example. > > I apologize for attaching the patch, I should have known better! > np :) > > > >> We have found an issue in the 9p implementation of QEMU, with how qid > >> paths are generated, which can cause qid path collisions and several > >> issues caused by them. In our use case (running containers under VMs) > >> these have proven to be critical. > >> > > > > Ouch... > > > >> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the > >> inode number of the file as input. According to the 9p spec the path > >> should be able to uniquely identify a file, distinct files should not > >> share a path value. > >> > >> The current implementation that defines qid.path = inode nr works fine as > >> long as there are not files from multiple partitions visible under the 9p > >> share. In that case, distinct files from different devices are allowed to > >> have the same inode number. So with multiple partitions, we have a very > >> high probability of qid path collisions. > >> > >> How to demonstrate the issue: > >> 1) Prepare a problematic share: > >> - mount one partition under share/p1/ with some files inside > >> - mount another one *with identical contents* under share/p2/ > >> - confirm that both partitions have files with same inode nr, size, etc > >> 2) Demonstrate breakage: > >> - start a VM with a virtio-9p pointing to the share > >> - mount 9p share with FSCACHE on > >> - keep open share/p1/file > >> - open and write to share/p2/file > >> > >> What should happen is, the guest will consider share/p1/file and > >> share/p2/file to be the same file, and since we are using the cache it > >> will not reopen it. We intended to write to partition 2, but we just wrote > >> to partition 1. This is just one example on how the guest may rely on qid > >> paths being unique. > >> > >> In the use case of containers where we commonly have a few containers per > >> VM, all based on similar images, these kind of qid path collisions are > >> very common and they seem to cause all kinds of funny behavior (sometimes > >> very subtle). > >> > >> To avoid this situation, the device id of a file needs to be also taken as > >> input for generating a qid path. Unfortunately, the size of both inode nr > >> + device id together would be 96 bits, while we have only 64 bits for the > >> qid path, so we can't just append them and call it a day :( > >> > >> We have thought of a few approaches, but we would definitely like to hear > >> what the upstream maintainers and community think: > >> > >> * Full fix: Change the 9p protocol > >> > >> We would need to support a longer qid path, based on a virtio feature > >> flag. This would take reworking of host and guest parts of virtio-9p, so > >> both QEMU and Linux for most users. > >> > > > > I agree for a longer qid path, but we shouldn't tie it to a virtio flag > > since > > 9p is transport agnostic. And it happens to be used with a variety of > > transports. > > QEMU has both virtio-9p and a Xen backend for example. > > > >> * Fallback and/or interim solutions > >> > >> A virtio feature flag may be refused by the guest, so we think we still > >> need to make collisions less likely even with 64 bit paths. E.g. > > > > In all cases, we would need a fallback solution to support current > > guest setups. Also there are several 9p server implementations out > > there (ganesha, diod, kvmtool) that are currently used with linux > > clients... it will take some time to get everyone in sync :-\ > > > >> 1. XOR the device id with inode nr to produce the qid path (we attach a > >> proof of concept patch) > > > > Hmm... this would still allow collisions. Not good for fallback. > > > >> 2. 64 bit hash of device id and inode nr > > > > Same here. > > > >> 3. other ideas, such as allocating new qid paths and keep a look up table > >> of some sorts (possibly too expensive) > >> > > > > This would be acceptable for fallback. > > Maybe we can use the QEMU hash table > (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it > scales to millions of entries. Do you think it is appropriate in this case? > I don't know QHT, hence Cc'ing Emilio for insights. > I was thinking on how to implement something like this, without having to > maintain millions of entries... One option we could consider is to split the > bits into a directly-mapped part, and a lookup part. For example: > > Inode = > [ 10 first bits ] + [ 54 lowest bits ] > > A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ] > The prefix is uniquely allocated for each input. > > Qid path = > [ 10 bit prefix ] + [ inode 54 lowest bits ] > > Since inodes are not completely random, and we usually have a handful of > device IDs, we get a much smaller number of entries to track in the hash > table. > > So what this would give: > (1) Would be faster and take less memory than mapping the full > inode_nr,devi_id tuple to unique QID paths > (2) Guaranteed not to run out of bits when inode numbers stay below the > lowest 54 bits and we have less than 1024 devices. > (3) When we get beyond this this limit, there is a chance we run out of > bits to allocate new QID paths, but we can detect this and refuse to serve > the offending files instead of allowing a collision. > > We could tweak the prefix size to match the scenarios that we consider more > likely, but I think close to 10-16 bits sounds reasonable enough. What do you > think? > I think that should work. Please send a patchset :) > > > >> With our proof of concept patch, the issues caused by qid path collisions > >> go away, so it can be seen as an interim solution of sorts. However, the > >> chance of collisions is not eliminated, we are just replacing the current > >> strategy, which is almost guaranteed to cause collisions in certain use > >> cases, with one that makes them more rare. We think that a virtio feature > >> flag for longer qid paths is the only way to eliminate these issues > >> completely. > >> > >> This is the extent that we were able to analyze the issue from our side, > >> we would like to hear if there are some better ideas, or which approach > >> would be more appropriate for upstream. > >> > >> Best regards, > >> > > > > Cheers, > > > > -- > > Greg > > >