Bert: Sorry for getting emotionally worked up in my last response. I know you're only expressing your concern about maintaining speed, and that's fine.
- Julian ----- Original Message ----- > From: Julian Foad <julianf...@btopenworld.com> > To: Bert Huijben <b...@qqmail.nl> > Cc: 'Paul Burba' <ptbu...@gmail.com>; 'Subversion Development' > <dev@subversion.apache.org> > Sent: Thursday, 13 September 2012, 18:27 > Subject: Re: [RFC] Inheritable Properties Branch -- caching props in WC DB > > Bert Huijben wrote: > >>> From: Julian Foad [mailto:julianf...@btopenworld.com] >>> >>> This branch caches properties of certain repository nodes, in a new > place >>> in the WC DB. >>> >>> I haven't examined it at all yet. The Wiki page says: "The > cache will be >>> stored in a new BLOB column in the NODES table called > 'inherited_props'. >>> If the node is the base node of a WC root then this column is > populated >>> with a serialized skel of the node's inherited properties. In all > other >>> cases it is NULL." >>> >>> Another new special column in the NODES table... the twenty-third >>> column. Do we have to? Is it not complex enough already? >>> >>> I want to make a radical suggestion. >>> >>> What would it take to be able to share some data instead of creating a >>> special-purpose column just for this cache? Well, it would take more > work >>> initially, but hear me out, as I have a hunch it may make less work in > the >>> end. >>> >>> I think we would do well to factor out the storage of all data about a >>> repository node-rev, moving that into a separate table that is >>> *referenced* from the NODES table. That new table would be indexed by >>> node-rev (that is: repos_id, rev, >>> relpath) and have a column for the properties and perhaps other > columns >>> for other attributes of the node-rev. >> >> That would be two additional table lookups per retrieval inside Sqlite3, >> [...] Have you investigated the performance costs of this proposal? > > No. > >> I heard similar suggestions about properties before, but additional table >> lookups are certainly not free in Sqlite. They are not expensive by itself, >> but every extra db transaction is certainly a performance hit. >> >>> What we need to know here (for inherited props) is the properties set > on >>> each ancestor of ever WC "root node" (which is defined > elsewhere and >>> includes switched nodes etc.) That's not a complex requirement. > Instead >>> of creating a new column dedicated to storing the union of all the > props >>> inherited from every ancestor node, how about we fetch and > concatenate, >>> on request, the node-rev table rows for all the pathwise ancestors of > the >>> desired node. The cache maintenance then becomes ensuring that a set > of >>> node-rev table rows are populated (populated in the standard way for > such >>> rows), and that those rows are removed when no longer referenced. >>> >>> But this particular use is not the only reason for factoring out this > data >>> into a separate table. There is: >>> >>> - For inherited props, this use. >>> >>> - For inherited props, if the current model of BASE node inheritance >>> remains, we're also going to need to cache the inherited props for > every >>> mixed-rev node (that is, every node whose rev != its parent-in-BASE >>> rev). (I'll write separately about why we need this for off-line >>> operation.) >> >> I think this case was explicitly ignored. >> >> This would require that operations like commit start updating parts of the >> working copy with information that is not available locally. > > Yes. That's not a problem, is it? > >> Normally the commit shouldn't be able to introduce invalid inherited > state >> (or the commit would fail). > > Pardon? I don't understand what you're saying there. > >> And an explicit update of a subtree already adds the necessary information >> to the root (if the current state of the code still matches what I read >> before) >> >>> - We already have (I think) the case where a local WC-to-WC copy or > move >>> of a tree results in poulating another set of tree nodes in the NODES > table >>> with an identical copy of all the data. Whenever we can reduce the > amount >>> of such copying, if the additional complexity is not too great, that > can >>> make it easier to achieve both correctness and efficiency. >> >> When we copy from BASE to another layer we don't have to copy the > inherited >> properties. The inheritance in the new location will be based on the new >> url, not the original location. So the current queries already handle this >> case by not copying the data from the BASE layer. > > I'm talking there about regular, explicit properties, not inherited > properties. > > >>> - For conflicts, it's useful to store the "theirs" and > "old base" >>> versions of the conflicted node. Its kind, its properties, the > reference >>> to its (pristine) text, and perhaps other attributes. >> >> On trunk everything is prepared to store conflicts in a skel, what has this >> to do with the inherited property storage? > > It has nothing to do with inherited properties except that inherited > properties > and conflicts could both refer to this table of cached node-revs. > >> In format 30 we already have the urls and properties of the conflicted > nodes >> (not the pristine text yet, but with the url we can retrieve that or we can >> implement storing the sha1 references in the separate columns as initially >> planned) > > If we go this way, then I would suggest we migrate the conflict storage skel > to > use (repos_id, rev, relpath) references to node-rev rows instead of storing > the > props and other bits of metadata explicitly. > > >>> - I think it would make the WC *much* more maintainable. At the > moment, >>> there doesn't appear to be any codified connection between those > eight >>> columns in NODES. There doesn't even seem to be any comment in > the doc >>> strings to indicate that. It's almost impossible to analyze the > huge >>> query statements to check that they're all updated together. >>> >>> Therefore I propose a WC development that adds a new table (call it >>> NODEREV?), indexed by repos/rev/relpath, with eight further columns > that >>> will be moved out of the current NODES table: >>> >>> repos_id, revision, repos_path -- the index columns >>> kind >>> properties >>> checksum >>> symlink_target >>> changed_revision >>> changed_date >>> changed_author >>> translated_size >> >> (translated_size or really recorded_size shouldn't be in this list as > it can >> have a different value for the same file if it is checked out multiple >> times) > > Oh, how can that be? Oh, you mean the documentation is wrong? Well, fancy > that. > > >>> Then we add an indirection from the NODES table via the index columns >>> whenever we access those columns. And some means of controlling >>> creation and deletion of those rows, perhaps using a ref-count. >> >> Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries > and >> even more for cascading updates to provider better maintainability, while a >> typical checkout working copy only contains nodes once? >> >> I did a lot of performance work to get at the overall 15% speedup we have >> since 1.7. Do you have any ideas on how much of that we will lose after > this >> suggested refactoring? > > No, I have no idea. Guess... guess... 0.1%? > > Seriously, please let me put out a design refactoring proposal without > beating > me down with the "slowness" argument. I know speed is important, but > at the moment I'm concerned WC maintenance is soon going to die. To me it > looks just as bad (in terms of undocumented complexity) as WC-1.0. It's a > good effort, don't get me wrong -- I applaud the success of the WC rewrite > and the work you've put in to make it fast as well as correct. That's > all great, but I occasionally look at the code and the (non-)documentation > and > close the file again and leave it to someone else. > > If the idea is rubbish because it has unfeasible performance, then fine, we > won't do it. I'm sure you know much more than I do about WC DB > performance, so maybe you're right. > > I *know* you and others are concerned about "performance" (I put it in > > quotes because it's referring to just one kind of performance: that is, > speed of performing a given operation). Sure that kind of performance is > important, I don't deny it, and I don't want to needlessly reduce it > either. > > But how about recognizing that there's some value to my suggestion and > seeking a way to keep the code-organization value without hitting the > performance penalty, instead of just hitting me with "doing that will add > some run-time cost" implying that's so wrong it doesn'y bear > thinking about. > > Maybe we could make a tiny but constructive start by adding a note in the > documentation that those columns all belong "together" in the sense of > being attributes of a node-rev, and maybe use some C preprocessor macros to > start handling them all together where it makes sense, instead of leaving > developers to try to remember which eight of the twenty-three arguments to a > SELECT statement or a 'read_info' call should all be null or all be > non-null together. > > >> I'm no fan of adding yet another layer in the working copy to just >> theoretically improve maintainability. That layer will also require >> maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases >> we are slowly moving to a decentralized DVCS and maybe the storage should >> then also move to such a model. > > Sorry, I'm not following. > >>> I acknowledge that such a development is far from trivial, and I > don't >>> expect Paul to do it as part of this work. I just want to get the >>> proposal out there and see if there is any consensus for working > toward >>> something like it. >>> >>> In terms of the inheritable props branch, I wonder if there's any > way >>> we can move a bit closer to this proposal without going the whole hog, >>> but I can't think of anything. >> >> I don't think this proposal, the inherited properties and/or the >> conflicts-skels proposals are directly related. >> >> >> And before we look at this or similar refactorings I think we should > measure >> the impact on some 'simple' case like 'svn status', that > would >> have at least >> an additional lookup for every layer of every node for the foreign key >> handling. >> >> Bert > > - Julian >