On Fri, May 15, 2015 at 6:55 PM, Al Viro <v...@zeniv.linux.org.uk> wrote: > > See upthread. It might be doable (provided that we turn ->i_mutex into > rwsem, to keep the exclusion with directory _modifiers_), but it'll need > a really non-trivial code review of a bunch of filesystems, especially ones > that want to play with the list of children like ceph does. And things > like sillyrename and dcache-populating readdir instances, albeit not as scary > as ceph. And then there's lustre...
Yup. I don't think it's viable if we can't do it gradually, and leave filesystems with the option to basically keep the existing locking. Because most won't care that deeply anyway, and some have complications like ceph. But we might be able to do *some* changes that wouldn't be that noticeable. For example, something like - phase 1: Turn i_mutex into an rwsem, change all users to take it for writing This part should be pretty much a semantic no-op. - phase 2: For filesystems that say that they are ok with, make lookup_slow() (and *only* lookup_slow for now) instead take the rwsem for reading, but in addition to that, take a hashed mutex. By "hashed mutex", I mean having a smallish table of mutexes (say, 1024), and just creating a hash based on the name-hash and the parent pointer. That way we can avoid all the issues with adding a new lock to the dentry itself, or having to allocate a new child dentry just for the lock. It *could* cause some cross-directory serialization due to hash collisions, but that shouldn't be noticeable if the hash is of a reasonable size and quality. That would allow lookups (and _only_ lookups) to happen in parallel, but the hashed mutex would mean that you'd serialize the "same name in same directory" case. And we'd require filesystems to say "I can support this concurrent lookup model". There might be a "phase 3" and so on where we could expand this to slightly more than just lookup_slow(), but I suspect that even doing it *just* there would already catch the bulk of issues. And requiring filesystems to sign up for it means that we can ignore any ugly cases. I dunno. The above _sounds_ fairly safe and easy because of how it limits the impact. But I might be missing something. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/