On Tue, Feb 17, 2015 at 3:44 AM, Ben Reser <b...@reser.org> wrote: > > I think a potentially bigger problem might be what to do about cache state. > Cache reads are obviously not a problem. Theoretically, a cache write > shouldn't happen until after a repository modification so that if we fail > the > repository write then we shouldn't end up with stale cache data. The cache > design to be thread safe should be the same as what we need for this. The > only > thing I'm fuzzy on is if it's possible to have an assertion in the middle > of a > cache write that creates invalid state. This sort of scenario is the type > of > thing that using something like memcached for caching would illuminate > already > but I don't think very many people use that support and instead almost all > of > our caches are using per process caches, that might have such problems. > Solving this would be a lot harder than simply auditing the clearing of > errors, > it'd be a lot of careful review of cache writes and you have to ensure this > behavior gets maintained forever (the error clearing requires that too but > it's > a fairly simple rule to follow). >
I just went through the cache_membuffer code looking for failure modes and this is what I found. Other caches may behave differently. * The assert()s within the cache code itself should not be in effect in a production (NDEBUG) environment. * If they are and get triggered, the cache meta data is corrupt. A call to the new svn_cache__membuffer_clear will reset all meta data. * A failure in the APR locking code may leave the cache in an inaccessible state. That state may be transient or permanent. Failing to return the write lock will cause an infinite wait for all readers. Clearing the cache will not work as it also needs to acquire the write lock first. * Reads and writes use (de-)serialization callbacks that might fail with svn_error_t* != 0. Locks will still be released orderly. * svn_cache__set serializes the data first into a temp. buffer and only then copies it into the cache. Callback failures will not corrupt the cache. * svn_cach__set_partial modifies the data in-situ. If the callback fails, the entry gets removed from cache. * Only svn_cach__get_partial and svn_cach__set_partial will execute their callbacks while holding a cache lock. Those callbacks may fail but must never abort(). * Some partial getters call into other svn libs which might use assert() / abort(). On systems that use the rwlock, aborting those will turn (segments of) the cache r/o but otherwise functional. * There is only one partial setter, ATM, and that one does not use assert() / abort() nor SVN_ERR_ASSERT. IOW, binaries that have assert() enabled may leave the cache in an unusable state if the assertion is triggered within the cache context. IIUC, there is no way to test for that context, so we must assume that the cache is dead. svn_error_t* returns, OTOH, should be transient and will not affect cache consistency and usability. Therefore, SVN_ERR_ASSERT should behave like a normal error rather than abort, i.e svn_error_malfunction needs to return and svn_error_t. Finally, svn_error_malfunction _cannot_ clear the cache because that may require recursive locks. Instead the cache clear should happen in the cache error handler (svn_cache__set_error_handler). This ultimately is something we have to solve, if we want to use a central > server model in the future in order to have a central cache, we can't be > aborting. So while I think doing this right might be slightly painful, > long > term it'll pay off. > Again, only speaking for cache_membuffer here, there are multiple options. One would be clearing the cache. Another is adding another key prefix such that new sessions will use a separate key space. And there are more. It all depends on how requests / sessions that have not failed, yet, are supposed to be treated (best effort, forcibly abort, ...). Another consideration would be the handling of repeated failures. A central server could decide to "degrade" service on a "frequently failing" repository by only using minimal, session local caches for it. We should have that on our radar once we start development on a central server. -- Stefan^2.