On Sat, Apr 12, 2014 at 8:35 PM, Daniel Shahaf <d...@daniel.shahaf.name>wrote:
> Stefan Fuhrmann wrote on Fri, Apr 11, 2014 at 22:01:53 +0200: > > Now, if there were 50 requests for the *same* reconstructed data: > > > > file_t repo_files[50][20] > > for k in 0..49 parallel_do > > for i in 0..19 : repo_files[k][i].open("revs/$i") > > result[k] = "" > > for i in 0..19 : result[k].combine(repo_files[k][i].read())) > > > > Caches don't help if they don't contain the data: > > > > for k in 0..49 parallel_do > > result[k] = cache.lookup() // fails for all at the same time > > if result[k].missing then > > // moved sub-loops to sub-function for clarity > > result[k] = reconstruct(repo_files[k]) > > cache.add(result[k]) > > > > There are two major problems with that: > > > > (1) We process the same data 50 times while once would suffice. > > SVN-internal caches did not help; however the OS may only > > have read the data once and then fed us from disc cache. > > In other words: the problem is not eviction out of the disk cache but > having to recreate the same fulltext many times. > Yes. It's about what has to happen before anything gets *into* the cache(s) and to not waste effort doing it. The principle applies to other data as well (revprop packs, manifest / index files etc.). > > (2) We keep 1000 files open. On some systems, this may cause > > resource shortages. > > It sounds like (2) is an independent problem: we open files in advance > of actually needing them. (i.e., something the branch doesn't/won't > change) > Yes and no. It is quite hard to change our file usage pattern as we have to keep files open in the event of a background 'svnadmin pack' and the branch won't change that. However, having only one request opening a bunch of files instead of many requests doing the same will avoid the resource issues in many scenarios. > > The ideal solution / control flow would look like this: > > > > for k = 0 do > > result[k] = reconstruct(repo_files[k]) > > cache.add(result[k]) > > > > for k in 1..49 parallel_do > > result[k] = cache.lookup() > > Just to be clear, result[k] is some fulltext, right? (of a file or of a > directory) > Yes. Or similar cachable object. > > Since we don't (can't?) coordinate requests on a global level, this > > is what we do on the thunder branch: > > I'm not sure why you say we can't do global coordination. We can use > memcached for the fulltext cache, can't we? What I was trying to say is that we can't / won't order the requests in the sense that one becomes the dedicated "scout thread" and everyone else would then be explicitly fed from cache only. Instead, any random one may request a given data first and all others wanting the same data will wait for a short period. Then, they check whether they can get the data from cache and read it from disk if they can't. So, all requests are still very much independent from one another, our caching does not need to be perfect and there are no timing constraints (cache data might eventually get evicted etc.) The cache is a singleton, > so the cache's getter is a chokepoint that all threads will go through, > so couldn't we put the entire thunder logic there? > > For example: > > svn_error_t * > svn_cache__get_and_if_absent_populate(void **value, > /* svn_boolean_t *found, */ > svn_cache__t *cache, > const void *key, > svn_error_t *(*factory)(void > **value, void *baton, apr_pool_t *scratch_pool), > void *factory_baton, > apr_pool_t *result_pool, > apr_pool_t *scratch_pool); > > where FACTORY is called if KEY is not in cache and either (a) the > current thread is the first in the thunder, or (b) the current thread is > not first in the thunder, but it has slept for X seconds and the cache > is unpopulated. > > Basically, keep in libsvn_fs only the determination of _which_ cache > keys are going to be hot paths, and the implementation of the "thunder" > considerations (let's maybe sleep to give another thread a chance) in > svn_cache__*(). > Well, that would be one option. I feel +/-0 on the cache interface change. I'll think about it. Right now, the branch uses two wrapper functions around the cache getters and achieve the same functionality (e.g. svn_fs_fs__thundered_cache_get). > > for k in 0..49 parallel_do > > result[k] = cache.lookup() > > if result[k].missing then > > > > token = thunder.coordinate(data_location) > > if token.another_got_to_read then // all but the first > > result[k] = cache.lookup() > > if result[k].ok : jump done // >90% hit rate > > > > result[k] = reconstruct(repo_files[k]) > > cache.add(result[k]) > > > > thunder.completed(token) > > done > > > > So, there is no penalty on the hot path, i.e. when data can be found > > in the respective cache. The coordinating instance is also conceptually > > simple (keep a list of all accesses in flight) and the delay for the > > first thread is negligible. Concurrent threads reading the same location > > will be blocked until the initial thread completed its access. That > > minimizes the code churn on the calling side. > > > > A timeout prevents rouge threads from blocking the whole system. Also, > > entries that timed out will be removed from the access list. The rouge > > thread would have to start another relevant data access (and be the > first) > > to block other threads for another time. > > Hmm. The way I imagined it, the non-first threads just do: > > if (I_am_first): > value = compute_value_of(key) > cache[key] = value > return value > else: > time.sleep(SLEEP_LENGTH) > try: > return cache[key] > except KeyError: > value = compute_value_of(key) > cache[key] = value > return value > > That way, no thread ever calls sleep() more than once, so threads will > never be delayed by more than SLEEP_LENGTH. > > (This should address Ivan's second concern.) > Well, this is roughly equivalent to what I implemented initially. The difference was that I would use a timed wait instead of sleep such that the waiting threads would resume a.s.a.p. Testing showed that all this, combined with the various mutex ops around it, created a 20% overhead and some hard-to-interpret CPU load stats. So, I changed it to simply using sleep() in r1586964 and removed all the bits that weren't necessary anymore. Now, things seem to scale nicely in many scenarios with extra clients increasing the processing time only slightly. > > My plan is to test the code on my server setup at home to get a more > > nuanced picture of what the performance and scalability impact is. > > On my SSD macbook, I get 70% less total CPU cost, 50% higher thoughput > > and smooth handling of 200 concurrent c/o (vs running out of file handles > > at 100 clients) over ra_svn. > > > > FWIW, what I wonder about is how well the change scales down (to small > setups) and how the code is going to look with the feature added. > It's all implemented and ready for review now. So far, I haven't seen any measurable impact on single request scenarios. It is clear that small, concurrent requests may see an extra delay do to the sleep. But since we sleep only if I/O (or at least OS interaction) is necessary and only for 250ms max, the added user-visible delay will be short and, in most scenarios, infrequent. > > Should these trends be confirmed for "real" HW, networks and ra_serf, > > I'd love to see this code in 1.9 - after due review and feedback. > > Having it in 1.9 if it's ready in time would be great. That said, we're > releasing an alpha tomorrow, and I'm honestly not sure when it would be > too late to add this. (That is, I'm not sure when we enter feature > freeze. We don't allow any new features after an RC; I'm not sure how > long before an RC a feature such as this one would have to be merged.) > Well, I think the code is relatively simple and robust. The interaction with the existing code is also clear. I've been testing the code in those scenarios that sparked me into attempting this solution and things look promising. But it still needs broader testing and I want to be able to fully explain all the data that I see. In the end, I need to be able to justify the release in 1.9. Otherwise, it would probably go into 1.10. I'll continue work on that once I come back after Easter. Until then, review and feedback would be very much appreciated. -- Stefan^2.