[ finally getting back to this mail; having slept on it, etc. ] Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100: > On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d...@daniel.shahaf.name> > wrote: > > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100: > >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d...@daniel.shahaf.name> > >> wrote: > >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100: > >> >> I am now considering to abandon the tokens-approach, for the following > >> >> reasons: > >> > ... > >> >> So, unless someone can convince me otherwise, I'm probably going to > >> >> stop with the token approach. Because of 2), I don't think it's worth > >> >> it to spend the effort needed for 1), especially because the > >> >> byte-based approach already works. > >> >> > >> > > >> > In other words, you're saying that the token-based approach: (b) won't be > >> > as fast as the bytes-based approach can be, and (a) requires much effort > >> > to be spent on implementing the reverse reading of tokens? (i.e., > >> > a backwards gets()) > >> > >> Yes. > >> > >> The reverse reading is quite hard (in the diff_file.c case) because of > >> the chunked reading of files. A line may be split by a "chunk > >> boundary" (it may even be split in the middle of an eol sequence > >> (between \r and \n), and it all still needs to be > >> canonicalized/normalized correctly (that's why we'll also need a > >> reverse normalization function). The current forward get_next_token > >> does this very well, and the reverse should be analogous, but I just > >> find it hard to reason about, and to get it implemented correctly. It > >> will take me a lot of time, and with a high probability of introducing > >> subtle bugs for special edge cases. > >> > > > > OK, so a reverse get_next_token() could be tricky to implement. But, > > worse, won't having it mean that implementors of svn_diff_fns_t can't > > have streamy access to their source? Since they would be required to > > provide sometimes a token from the start and sometimes a token from the > > end... > > > > Well, it's not streamy in our usual sense of the word, but it's > > "double-streamy" (no one requests the middle byte until either all > > bytes before or all bytes after it were transmitted already) > > Oooh, I hadn't considered other implementors besides the internal > diff_memory.c and diff_file.c.
Just to clarify: diff_file.c and diff_memory.c are the only implementors of svn_diff_fns_t in the core code, correct? > You're right, that would impose > additional constraints on other implementors. I don't know if being > non-streamy (or less streamy anyway) would be problem ... > We should have asked this before, but: Do we know who are the other implementors / typical use-cases of svn_diff_fns_t? > In my last commit on the -tokens branch, I added a flag "open_at_end" > to the datasource_open function (function of svn_diff_fns_t), so the > datasource could be opened at the end. Also, other supporting > functions were needed: token_pushback_prefix, token_pushback_suffix > (to "push back" tokens that were read too far when scanning for > prefix/suffix) and the dreaded datasource_get_previous_token. Anyway, > the high-level idea was: > > 1) Open datasources at the end. > > 2) Scan for identical suffix (getting tokens in reverse). > > 3) At first non-match: pushback suffix token, and note where suffix starts. > > 4) Open datasources at the beginning. > > 5) Scan for identical prefix (getting tokens normally, but without hash). > > 6) At first non-match: pushback prefix token. > > 7) Run the "old algorithm": getting tokens forward, but with hash. The > get_next_token function should stop (return NULL for token) when it > arrives at the suffix. > > > Sidestep: I just now realized that I probably don't need to have the > "reverse normalization algorithm" for implementing get_previous_token. > The call to the normalization function in get_next_token is (I think) > only needed to be able to calculate the hash. But since > get_previous_token doesn't need to calculate hashes, I may be able to > get by without normalization there. I'd only need to normalize inside > token_compare, and I *think* I can just to that "forwardly", instead > of backwards. Just thinking out loud here ... > Is "normalization function" the thing that collapses newlines and whitespaces if -x-w or -x--ignore-eol-style is in effect? If so, another purpose of calling that might be to make suffixes that are not bytewise-identical, but only modulo-whitespace-identical, also be lumped into the "identical suffix". (Haven't dived into the code yet; the above is based on my understanding of your high-level descriptions) > So: that makes the token-approach again a little bit more possible. > But do we want it? It requires a lot more from implementors of > svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix > optimization to all implementors of svn_diff_fns_t ... > Another option is to add the "read backwards" callbacks to the API, but designate them as optional. Then you only do the suffix half of the optimization of both/all sources support the backwards callbacks. (Is it a good idea? Don't know. But it's one option.) > >> >> Any thoughts? > >> >> > >> > > >> > -tokens/BRANCH-README mentions one of the advantages of the tokens > >> > approach being that the comparison is done only after whitespace and > >> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in > >> > effect). IIRC, the -bytes approach doesn't currently take advantage of > >> > these -x flags? > >> > > >> > What is the practical effect of the fact the -bytes approach doesn't > >> > take advantage of these flags? If a file (with a moderately long > >> > history) has had all its newlines changed in rN, then I assume that your > >> > -bytes optimizations will speed up all the diffs that 'blame' performs > >> > on that file, except for the single diff between rN and its predecessor? > >> > >> Yes, I thought that would be an important advantage of the tokens > >> approach, but as you suggest: the practical effect will most likely be > >> limited. Indeed, only this single diff between rN and its predecessor > >> (which may be 1 in 1000 diffs) will not benifit from > >> prefix/suffix-optimization. Even if the rate of such changes is like > >> 1/100th (let's say you change leading tabs to spaces, and vice versa, > >> every 100 revisions), it will hardly be noticeable. > >> > >> The perfectionist in me still thinks: hey, you can also implement this > >> normalization with the byte-based approach (in a streamy way). But > >> that will probably be quite hard, because I'd have to take into > > > > We can add that additional optimization after the basic -bytes support > > is working? No need to implement all possible optimizations in one > > shot (let alone those of them that might have little relative benefit). > > Ok, that minor optimization can definitely wait. > > > So the choice still boils down to: > 1) A specific implementation inside diff_file.c (byte-based). > Not a bad option, I think. The improvement isn't as useful in diff_memory.c, since that API isn't likely to be used for large amounts of data anyway (e.g., trunk uses it only for propery diffs). > 2) A generic implementation for all implementors of svn_diff_fns_t, > which still requires some work (get_previous_token), and imposes more > constraints on svn_diff_fns_t implementors ... > Might also be a not bad option... :-) > -- > Johan