Re: [RFC] Inheritable Properties Branch -- caching props in WC DB

Julian Foad Thu, 13 Sep 2012 15:50:27 -0700

Bert: Sorry for getting emotionally worked up in my last response.  I know 
you're only expressing your concern about maintaining speed, and that's fine.


- Julian



----- Original Message -----
> From: Julian Foad <julianf...@btopenworld.com>
> To: Bert Huijben <b...@qqmail.nl>
> Cc: 'Paul Burba' <ptbu...@gmail.com>; 'Subversion Development' 
> <dev@subversion.apache.org>
> Sent: Thursday, 13 September 2012, 18:27
> Subject: Re: [RFC] Inheritable Properties Branch -- caching props in WC DB
> 
> Bert Huijben wrote:
> 
>>>   From: Julian Foad [mailto:julianf...@btopenworld.com]
>>> 
>>>   This branch caches properties of certain repository nodes, in a new 
> place
>>>  in  the WC DB.
>>> 
>>>   I haven't examined it at all yet.  The Wiki page says: "The 
> cache will be
>>>  stored  in a new BLOB column in the NODES table called 
> 'inherited_props'.  
>>>  If the  node is the base node of a WC root then this column is 
> populated
>>>  with a  serialized skel of the node's inherited properties.  In all 
> other
>>>  cases  it is  NULL."
>>> 
>>>   Another new special column in the NODES table... the twenty-third
>>>   column.  Do we have to?  Is it not complex enough already?
>>> 
>>>   I want to make a radical suggestion.
>>> 
>>>   What would it take to be able to share some data instead of creating a
>>>   special-purpose column just for this cache?  Well, it would take more 
> work
>>>   initially, but hear me out, as I have a hunch it may make less work in 
> the
>>>  end.
>>> 
>>>   I think we would do well to factor out the storage of all data about a
>>>   repository node-rev, moving that into a separate table that is
>>>  *referenced*  from the NODES table.  That new table would be indexed by
>>>  node-rev (that  is: repos_id, rev,
>>>   relpath) and have a column for the properties and perhaps other 
> columns
>>>   for other attributes of the node-rev.
>> 
>>  That would be two additional table lookups per retrieval inside Sqlite3,
>>  [...] Have you investigated the performance costs of this proposal?
> 
> No.
> 
>>  I heard similar suggestions about properties before, but additional table
>>  lookups are certainly not free in Sqlite. They are not expensive by itself,
>>  but every extra db transaction is certainly a performance hit.
>> 
>>>   What we need to know here (for inherited props) is the properties set 
> on
>>>   each ancestor of ever WC "root node" (which is defined 
> elsewhere and
>>>   includes switched nodes etc.)  That's not a complex requirement.  
> Instead
>>>  of  creating a new column dedicated to storing the union of all the 
> props
>>>   inherited from every ancestor node, how about we fetch and 
> concatenate,
>>>   on request, the node-rev table rows for all the pathwise ancestors of 
> the
>>>   desired node.  The cache maintenance then becomes ensuring that a set 
> of
>>>   node-rev table rows are populated (populated in the standard way for 
> such
>>>   rows), and that those rows are removed when no longer referenced.
>>> 
>>>   But this particular use is not the only reason for factoring out this 
> data
>>>  into a  separate table.  There is:
>>> 
>>>     - For inherited props, this use.
>>> 
>>>     - For inherited props, if the current model of BASE node inheritance
>>>   remains, we're also going to need to cache the inherited props for 
> every
>>>   mixed-rev node (that is, every node whose rev != its parent-in-BASE
>>>   rev).  (I'll write separately about why we need this for off-line
>>>  operation.)
>> 
>>  I think this case was explicitly ignored.
>> 
>>  This would require that operations like commit start updating parts of the
>>  working copy with information that is not available locally.
> 
> Yes.  That's not a problem, is it?
> 
>>  Normally the commit shouldn't be able to introduce invalid inherited 
> state
>>  (or the commit would fail).
> 
> Pardon?  I don't understand what you're saying there.
> 
>>  And an explicit update of a subtree already adds the necessary information
>>  to the root (if the current state of the code still matches what I read
>>  before)
>> 
>>>     - We already have (I think) the case where a local WC-to-WC copy or 
> move
>>>   of a tree results in poulating another set of tree nodes in the NODES 
> table
>>>   with an identical copy of all the data.  Whenever we can reduce the 
> amount
>>>   of such copying, if the additional complexity is not too great, that 
> can
>>>  make it  easier to achieve both correctness and efficiency.
>> 
>>  When we copy from BASE to another layer we don't have to copy the 
> inherited
>>  properties. The inheritance in the new location will be based on the new
>>  url, not the original location. So the current queries already handle this
>>  case by not copying the data from the BASE layer.
> 
> I'm talking there about regular, explicit properties, not inherited 
> properties.
> 
> 
>>>     - For conflicts, it's useful to store the "theirs" and 
> "old base"
>>>   versions of the conflicted node.  Its kind, its properties, the 
> reference
>>>  to its  (pristine) text, and perhaps other attributes.
>> 
>>  On trunk everything is prepared to store conflicts in a skel, what has this
>>  to do with the inherited property storage?
> 
> It has nothing to do with inherited properties except that inherited 
> properties 
> and conflicts could both refer to this table of cached node-revs.
> 
>>  In format 30 we already have the urls and properties of the conflicted 
> nodes
>>  (not the pristine text yet, but with the url we can retrieve that or we can
>>  implement storing the sha1 references in the separate columns as initially
>>  planned)
> 
> If we go this way, then I would suggest we migrate the conflict storage skel 
> to 
> use (repos_id, rev, relpath) references to node-rev rows instead of storing 
> the 
> props and other bits of metadata explicitly.
> 
> 
>>>     - I think it would make the WC *much* more maintainable.  At the 
> moment,
>>>   there doesn't appear to be any codified connection between those 
> eight
>>>   columns in NODES.  There doesn't even seem to be any comment in 
> the doc
>>>   strings to indicate that.  It's almost impossible to analyze the 
> huge
>>>  query  statements to check that they're all updated together.
>>> 
>>>   Therefore I propose a WC development that adds a new table (call it
>>>   NODEREV?), indexed by repos/rev/relpath, with eight further columns 
> that
>>>   will be moved out of the current NODES table:
>>> 
>>>     repos_id, revision, repos_path -- the index columns
>>>     kind
>>>     properties
>>>     checksum
>>>     symlink_target
>>>     changed_revision
>>>     changed_date
>>>     changed_author
>>>     translated_size
>> 
>>  (translated_size or really recorded_size shouldn't be in this list as 
> it can
>>  have a different value for the same file if it is checked out multiple
>>  times)
> 
> Oh, how can that be?  Oh, you mean the documentation is wrong?  Well, fancy 
> that.
> 
> 
>>>   Then we add an indirection from the NODES table via the index columns
>>>   whenever we access those columns.  And some means of controlling
>>>   creation and deletion of those rows, perhaps using a ref-count.
>> 
>>  Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries 
> and
>>  even more for cascading updates to provider better maintainability, while a
>>  typical checkout working copy only contains nodes once?
>> 
>>  I did a lot of performance work to get at the overall 15% speedup we have
>>  since 1.7. Do you have any ideas on how much of that we will lose after 
> this
>>  suggested refactoring?
> 
> No, I have no idea.  Guess... guess... 0.1%?
> 
> Seriously, please let me put out a design refactoring proposal without 
> beating 
> me down with the "slowness" argument.  I know speed is important, but 
> at the moment I'm concerned WC maintenance is soon going to die.  To me it 
> looks just as bad (in terms of undocumented complexity) as WC-1.0.  It's a 
> good effort, don't get me wrong -- I applaud the success of the WC rewrite 
> and the work you've put in to make it fast as well as correct.  That's 
> all great, but I occasionally look at the code and the (non-)documentation 
> and 
> close the file again and leave it to someone else.
> 
> If the idea is rubbish because it has unfeasible performance, then fine, we 
> won't do it.  I'm sure you know much more than I do about WC DB 
> performance, so maybe you're right.
> 
> I *know* you and others are concerned about "performance" (I put it in 
> 
> quotes because it's referring to just one kind of performance: that is, 
> speed of performing a given operation).  Sure that kind of performance is 
> important, I don't deny it, and I don't want to needlessly reduce it 
> either.
> 
> But how about recognizing that there's some value to my suggestion and 
> seeking a way to keep the code-organization value without hitting the 
> performance penalty, instead of just hitting me with "doing that will add 
> some run-time cost" implying that's so wrong it doesn'y bear 
> thinking about.
> 
> Maybe we could make a tiny but  constructive start by adding a note in the 
> documentation that those columns all belong "together" in the sense of 
> being attributes of a node-rev, and maybe use some C preprocessor macros to 
> start handling them all together where it makes sense, instead of leaving 
> developers to try to remember which eight of the twenty-three arguments to a 
> SELECT statement or a 'read_info' call should all be null or all be 
> non-null together.
> 
> 
>>  I'm no fan of adding yet another layer in the working copy to just
>>  theoretically improve maintainability. That layer will also require
>>  maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases
>>  we are slowly moving to a decentralized DVCS and maybe the storage should
>>  then also move to such a model.
> 
> Sorry, I'm not following.
> 
>>>   I acknowledge that such a development is far from trivial, and I 
> don't
>>>  expect  Paul to do it as part of this work.  I just want to get the
>>>  proposal out there  and see if there is any consensus for working 
> toward
>>>  something like it.
>>> 
>>>   In terms of the inheritable props branch, I wonder if there's any 
> way 
>>>  we can  move a bit closer to this proposal without going the whole hog,
>>>  but I can't  think of anything.
>> 
>>  I don't think this proposal, the inherited properties and/or the
>>  conflicts-skels proposals are directly related.
>> 
>> 
>>  And before we look at this or similar refactorings I think we should 
> measure
>>  the impact on some 'simple' case like 'svn status', that 
> would 
>>  have at least
>>  an additional lookup for every layer of every node for the foreign key
>>  handling.
>> 
>>      Bert
> 
> - Julian
>

Re: [RFC] Inheritable Properties Branch -- caching props in WC DB

Reply via email to