Julian Foad <julianf...@apache.org> writes: > Conclusions: > ------------ > > It is certainly possible that we could modify "update" and the other > "online" operations, at least, and the previously "offline" operations > too if we want, to make them fetch pristines at point-of-use in this way. > > Such modifications are not trivial. There is the need to run additional > RA requests in between the existing ones, perhaps needing an additional > RA session to be established in parallel, or taking care with inserting > RA requests into an existing session.
I think that this part has a lot of protocol constraints and hidden complexity. And things could probably get even more complex for merge and diff. Consider a bulk update report over HTTP, which is just a single response that has to be consumed in a streamy fashion. There is no request multiplexing, and fetching data through a separate connection is going to limit the maximum size of a pristine file that can be downloaded without receiving a timeout on the original connection. Assuming the default HTTP timeout of httpd 2.4.x (60 seconds) and 100 MB/s data transfer rate, the limit for a pristine size is going to be around 6 GB. This kind of problem probably isn't limited to just this specific example and protocol, considering things like that an update editor driver transfers the control at certain points (e.g., during merge) and thus cannot keep reading the response. When I was working on the proof-of-concept, encountering these issues stopped me from considering the approach with fetching pristines at the point of access as being practically feasible. That also resulted in the alternative approach, initially implemented on the `pristines-on-demand` branch. Going slightly off-topic, I tend to think that even despite its drawbacks, the current approach on the branch should work reasonably well for the MVP. To elaborate on that, let me first share a few assumptions and thoughts I had in mind at that point of time: 1) Let's assume a high-throughput connection to the server, 1 Gbps or better. With slower networks, working with large files is going to be problematic just by itself, so that might be thought as being out of scope. 2) Let's assume that a working copy contains a large number of blob-like files. In other words, there are thousands of 10-100 MB files, as opposed to one or two 50 GB files. 3) Let's assume that in a common case, only a small fraction of these files are modified in the working copy. Then for a working copy with 1,000 of 100 MB files and 10 modified files: A) Every checkout saves 100 GB of disk space; that's pretty significant for a typical solid state drive. B) Hydrating won't transfer more than 1 GB of data, or 10 seconds under an optimistic assumption. C) For a more uncommon case with 100 modified files, it's going to result in 10 GB of data transferred and about two minutes of time; I think that's still pretty reasonable for an uncommon case. So while the approach used in the proof-of-concept might be non-ideal, I tend to think it should work reasonably well in a variety of use cases. I also think that this approach should even be releasable in the form of an MVP, accompanied with a UI option to select between the two states during checkout (all pristines / pristines-on-demand) that is persisted in the working copy. A small final note is that I could be missing some details or other cases, but in the meantime I felt like sharing the thoughts I had while working on the proof-of-concept. Thanks, Evgeny Kotkov