On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d...@daniel.shahaf.name> wrote: > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100: >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d...@daniel.shahaf.name> >> wrote: >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100: >> >> I am now considering to abandon the tokens-approach, for the following >> >> reasons: >> > ... >> >> So, unless someone can convince me otherwise, I'm probably going to >> >> stop with the token approach. Because of 2), I don't think it's worth >> >> it to spend the effort needed for 1), especially because the >> >> byte-based approach already works. >> >> >> > >> > In other words, you're saying that the token-based approach: (b) won't be >> > as fast as the bytes-based approach can be, and (a) requires much effort >> > to be spent on implementing the reverse reading of tokens? (i.e., >> > a backwards gets()) >> >> Yes. >> >> The reverse reading is quite hard (in the diff_file.c case) because of >> the chunked reading of files. A line may be split by a "chunk >> boundary" (it may even be split in the middle of an eol sequence >> (between \r and \n), and it all still needs to be >> canonicalized/normalized correctly (that's why we'll also need a >> reverse normalization function). The current forward get_next_token >> does this very well, and the reverse should be analogous, but I just >> find it hard to reason about, and to get it implemented correctly. It >> will take me a lot of time, and with a high probability of introducing >> subtle bugs for special edge cases. >> > > OK, so a reverse get_next_token() could be tricky to implement. But, > worse, won't having it mean that implementors of svn_diff_fns_t can't > have streamy access to their source? Since they would be required to > provide sometimes a token from the start and sometimes a token from the > end... > > Well, it's not streamy in our usual sense of the word, but it's > "double-streamy" (no one requests the middle byte until either all > bytes before or all bytes after it were transmitted already)
Oooh, I hadn't considered other implementors besides the internal diff_memory.c and diff_file.c. You're right, that would impose additional constraints on other implementors. I don't know if being non-streamy (or less streamy anyway) would be problem ... In my last commit on the -tokens branch, I added a flag "open_at_end" to the datasource_open function (function of svn_diff_fns_t), so the datasource could be opened at the end. Also, other supporting functions were needed: token_pushback_prefix, token_pushback_suffix (to "push back" tokens that were read too far when scanning for prefix/suffix) and the dreaded datasource_get_previous_token. Anyway, the high-level idea was: 1) Open datasources at the end. 2) Scan for identical suffix (getting tokens in reverse). 3) At first non-match: pushback suffix token, and note where suffix starts. 4) Open datasources at the beginning. 5) Scan for identical prefix (getting tokens normally, but without hash). 6) At first non-match: pushback prefix token. 7) Run the "old algorithm": getting tokens forward, but with hash. The get_next_token function should stop (return NULL for token) when it arrives at the suffix. Sidestep: I just now realized that I probably don't need to have the "reverse normalization algorithm" for implementing get_previous_token. The call to the normalization function in get_next_token is (I think) only needed to be able to calculate the hash. But since get_previous_token doesn't need to calculate hashes, I may be able to get by without normalization there. I'd only need to normalize inside token_compare, and I *think* I can just to that "forwardly", instead of backwards. Just thinking out loud here ... So: that makes the token-approach again a little bit more possible. But do we want it? It requires a lot more from implementors of svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix optimization to all implementors of svn_diff_fns_t ... >> >> Any thoughts? >> >> >> > >> > -tokens/BRANCH-README mentions one of the advantages of the tokens >> > approach being that the comparison is done only after whitespace and >> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in >> > effect). IIRC, the -bytes approach doesn't currently take advantage of >> > these -x flags? >> > >> > What is the practical effect of the fact the -bytes approach doesn't >> > take advantage of these flags? If a file (with a moderately long >> > history) has had all its newlines changed in rN, then I assume that your >> > -bytes optimizations will speed up all the diffs that 'blame' performs >> > on that file, except for the single diff between rN and its predecessor? >> >> Yes, I thought that would be an important advantage of the tokens >> approach, but as you suggest: the practical effect will most likely be >> limited. Indeed, only this single diff between rN and its predecessor >> (which may be 1 in 1000 diffs) will not benifit from >> prefix/suffix-optimization. Even if the rate of such changes is like >> 1/100th (let's say you change leading tabs to spaces, and vice versa, >> every 100 revisions), it will hardly be noticeable. >> >> The perfectionist in me still thinks: hey, you can also implement this >> normalization with the byte-based approach (in a streamy way). But >> that will probably be quite hard, because I'd have to take into > > We can add that additional optimization after the basic -bytes support > is working? No need to implement all possible optimizations in one > shot (let alone those of them that might have little relative benefit). Ok, that minor optimization can definitely wait. So the choice still boils down to: 1) A specific implementation inside diff_file.c (byte-based). 2) A generic implementation for all implementors of svn_diff_fns_t, which still requires some work (get_previous_token), and imposes more constraints on svn_diff_fns_t implementors ... -- Johan