Stefan Fuhrmann <[email protected]> writes: > Yes. This is exactly why we can only use it when we have reasonable control > over the stream's usage, i.e. we can use it in our CL tools because all the > code that will be run is under our control. But we cannot make e.g. > svn_stream_for_stdin() use it by default.
[...] > The best solution seems to be to allow for explicit resource management as > we do with other potentially "expensive" objects. r1700305 implements that. I have several concerns about these changes (r1698359 and r1700305): 1) The new public API — svn_stream_wrap_buffered_read() — is dangerous to use, and wrong usage causes unbounded memory consumption in seemingly harmless situations. A svn_stream_mark() mark placed on this stream causes all data that's read afterwards to be stored in memory, even if the data was processed and immediately discarded. Lifetime of the buffered data is bound to the lifetime of the pool where the mark object lives. Hence, avoiding memory usage issues now requires either precise pool management or explicit control of where and when the svn_stream_mark_t objects are added and removed. For example, prior to r1700305, svnadmin load-revprops was actually causing unbounded memory usage. Executing this command with a dump containing large revisions (e.g., .iso files) was quickly consuming all the available RAM on the machine. This behavior was caused by a svn_stream_mark_t allocated in a long-living pool that was preventing deallocation of the buffered data. As of r1700305, the problem is mitigated by calling svn_stream_remove_mark() within the svn_stream_readline() implementation. However, the new stream is generic, and every potential user of the stream API should be aware of these caveats, because she could in fact be working with the new stream. Place a mark somewhere in the middle of what svn_repos_parse_dumpstream3() does — and you could be already consuming unbounded amounts of memory. As another example, the pools are not being cleaned up / destroyed during error flow. Say, we successfully place a stream mark, but receive an error afterwards, and, coincidentally, the caller thinks that it's okay to keep reading from the stream after this specific error. Again, the memory usage is potentially unbounded. 2) The new API is backward incompatible. The problem is not only about how we handle this in our codebase. Existing API users have no idea about the fact that having svn_stream_mark_t objects could be "expensive", and about the existence of svn_stream_remove_mark(). Calling svn_stream_mark() can have a severe impact on the behavior of the stream, and can trigger memory usage issues if used improperly. The new stream is a generic svn_stream_t, but instances of this stream cannot cross the API boundaries, because if they do, the users would be required to adapt their code in order to avoid the possibility of unbounded memory consumption. 3) The behavior of pool-based reference counting is unpredictable. The new svn_stream_wrap_buffered_read() stream performs reference counting on a per-pool basis, and its behavior depends of whether the corresponding pools were cleared / destroyed, or not. This behavior is unpredictable and can lead to hard-to-diagnose problems that only happen under non-default circumstances or with certain data patterns. Furthermore, we have a history of dealing with unbounded memory usage that happened due to pool-based refcounting in the first-level FSFS DAG cache (CVE-2015-0202 [1]). As far as I recall, at some point a long-living pool was causing positive reference count and preventing deallocation of the cached objects. As I see it, the new stream does the same sort of reference counting for pools holding svn_stream_mark_t objects, and, similarly, if one of those objects is allocated in a long-living pool, it prevents the deallocation of the data buffered by the stream. Given the above, I am -1 on this optimization for svnadmin load-revprops that was implemented in r1698359 and r1700305. Please revert these changes. As for the problem itself, if the way we currently process the input during svnadmin load and load-revprops is causing a noticeable overhead, I think that we should introduce -F (--file) option to both of these commands: svnadmin load /path/to/repos -F (--file) /path/to/dump svnadmin load-revprops /path/to/repos -F (--file) /path/to/dump As long as file streams support both svn_stream_seek() and svn_stream_mark(), this should avoid byte-by-byte processing of the input and get rid of the associated overhead. [1] https://subversion.apache.org/security/CVE-2015-0202-advisory.txt Regards, Evgeny Kotkov

