Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Daniel Shahaf Wed, 01 Dec 2010 21:45:15 -0800

[ finally getting back to this mail; having slept on it, etc. ]

Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d...@daniel.shahaf.name> 
> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d...@daniel.shahaf.name> 
> >> wrote:
> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> >> I am now considering to abandon the tokens-approach, for the following 
> >> >> reasons:
> >> > ...
> >> >> So, unless someone can convince me otherwise, I'm probably going to
> >> >> stop with the token approach. Because of 2), I don't think it's worth
> >> >> it to spend the effort needed for 1), especially because the
> >> >> byte-based approach already works.
> >> >>
> >> >
> >> > In other words, you're saying that the token-based approach: (b) won't be
> >> > as fast as the bytes-based approach can be, and (a) requires much effort
> >> > to be spent on implementing the reverse reading of tokens?  (i.e.,
> >> > a backwards gets())
> >>
> >> Yes.
> >>
> >> The reverse reading is quite hard (in the diff_file.c case) because of
> >> the chunked reading of files. A line may be split by a "chunk
> >> boundary" (it may even be split in the middle of an eol sequence
> >> (between \r and \n), and it all still needs to be
> >> canonicalized/normalized correctly (that's why we'll also need a
> >> reverse normalization function). The current forward get_next_token
> >> does this very well, and the reverse should be analogous, but I just
> >> find it hard to reason about, and to get it implemented correctly. It
> >> will take me a lot of time, and with a high probability of introducing
> >> subtle bugs for special edge cases.
> >>
> >
> > OK, so a reverse get_next_token() could be tricky to implement.  But,
> > worse, won't having it mean that implementors of svn_diff_fns_t can't
> > have streamy access to their source?  Since they would be required to
> > provide sometimes a token from the start and sometimes a token from the
> > end...
> >
> > Well, it's not streamy in our usual sense of the word, but it's
> > "double-streamy" (no one requests the middle byte until either all
> > bytes before or all bytes after it were transmitted already)
> 
> Oooh, I hadn't considered other implementors besides the internal
> diff_memory.c and diff_file.c.


Just to clarify: diff_file.c and diff_memory.c are the only implementors
of svn_diff_fns_t in the core code, correct?

> You're right, that would impose
> additional constraints on other implementors. I don't know if being
> non-streamy (or less streamy anyway) would be problem ...
> 

We should have asked this before, but:

Do we know who are the other implementors / typical use-cases of
svn_diff_fns_t?

> In my last commit on the -tokens branch, I added a flag "open_at_end"
> to the datasource_open function (function of svn_diff_fns_t), so the
> datasource could be opened at the end. Also, other supporting
> functions were needed: token_pushback_prefix, token_pushback_suffix
> (to "push back" tokens that were read too far when scanning for
> prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
> the high-level idea was:
> 
> 1) Open datasources at the end.
> 
> 2) Scan for identical suffix (getting tokens in reverse).
> 
> 3) At first non-match: pushback suffix token, and note where suffix starts.
> 
> 4) Open datasources at the beginning.
> 
> 5) Scan for identical prefix (getting tokens normally, but without hash).
> 
> 6) At first non-match: pushback prefix token.
> 
> 7) Run the "old algorithm": getting tokens forward, but with hash. The
> get_next_token function should stop (return NULL for token) when it
> arrives at the suffix.
> 
> 
> Sidestep: I just now realized that I probably don't need to have the
> "reverse normalization algorithm" for implementing get_previous_token.
> The call to the normalization function in get_next_token is (I think)
> only needed to be able to calculate the hash. But since
> get_previous_token doesn't need to calculate hashes, I may be able to
> get by without normalization there. I'd only need to normalize inside
> token_compare, and I *think* I can just to that "forwardly", instead
> of backwards. Just thinking out loud here ...
> 

Is "normalization function" the thing that collapses newlines and
whitespaces if -x-w or -x--ignore-eol-style is in effect?  If so,
another purpose of calling that might be to make suffixes that are
not bytewise-identical, but only modulo-whitespace-identical, also
be lumped into the "identical suffix".

(Haven't dived into the code yet; the above is based on my understanding
of your high-level descriptions)

> So: that makes the token-approach again a little bit more possible.
> But do we want it? It requires a lot more from implementors of
> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
> optimization to all implementors of svn_diff_fns_t ...
> 

Another option is to add the "read backwards" callbacks to the API, but
designate them as optional.  Then you only do the suffix half of the
optimization of both/all sources support the backwards callbacks.

(Is it a good idea?  Don't know.  But it's one option.)

> >> >> Any thoughts?
> >> >>
> >> >
> >> > -tokens/BRANCH-README mentions one of the advantages of the tokens
> >> > approach being that the comparison is done only after whitespace and
> >> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> >> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
> >> > these -x flags?
> >> >
> >> > What is the practical effect of the fact the -bytes approach doesn't
> >> > take advantage of these flags?  If a file (with a moderately long
> >> > history) has had all its newlines changed in rN, then I assume that your
> >> > -bytes optimizations will speed up all the diffs that 'blame' performs
> >> > on that file, except for the single diff between rN and its predecessor?
> >>
> >> Yes, I thought that would be an important advantage of the tokens
> >> approach, but as you suggest: the practical effect will most likely be
> >> limited. Indeed, only this single diff between rN and its predecessor
> >> (which may be 1 in 1000 diffs) will not benifit from
> >> prefix/suffix-optimization. Even if the rate of such changes is like
> >> 1/100th (let's say you change leading tabs to spaces, and vice versa,
> >> every 100 revisions), it will hardly be noticeable.
> >>
> >> The perfectionist in me still thinks: hey, you can also implement this
> >> normalization with the byte-based approach (in a streamy way). But
> >> that will probably be quite hard, because I'd have to take into
> >
> > We can add that additional optimization after the basic -bytes support
> > is working?  No need to implement all possible optimizations in one
> > shot (let alone those of them that might have little relative benefit).
> 
> Ok, that minor optimization can definitely wait.
> 
> 
> So the choice still boils down to:
> 1) A specific implementation inside diff_file.c (byte-based).
> 

Not a bad option, I think.  The improvement isn't as useful in
diff_memory.c, since that API isn't likely to be used for large
amounts of data anyway (e.g., trunk uses it only for propery diffs).

> 2) A generic implementation for all implementors of svn_diff_fns_t,
> which still requires some work (get_previous_token), and imposes more
> constraints on svn_diff_fns_t implementors ...
> 

Might also be a not bad option... :-)

> -- 
> Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Reply via email to