> > but now i get to step into the git xdiff code and see !!

here's the call to the xdiff implementation:

887                     mustbe0(xdl_recs_cmp(&dd1, 0, dd1.nrec, &dd2, 0, 
dd2.nrec,
888                                 kvdf, kvdb, (xp.flags & XDF_NEED_MINIMAL) 
!= 0,
889                                 &xenv));

git added other implementations (patience, histogram) and patched different 
infrastructure on for those.
but i think this is the core xdiff implementation that uses the first-class 
infrastructure in the xdiff structures.

all the variables from xdf1 and xdf2 have been copied into dd1 and dd2:

878                     dd1.nrec = xe->xdf1.nreff;
879                     dd1.ha = xe->xdf1.ha;
880                     dd1.rchg = xe->xdf1.rchg;
881                     dd1.rindex = xe->xdf1.rindex;
882                     dd2.nrec = xe->xdf2.nreff;
883                     dd2.ha = xe->xdf2.ha;
884                     dd2.rchg = xe->xdf2.rchg;
885                     dd2.rindex = xe->xdf2.rindex;

So basically they are presenting the data to xdiff as if all the preprocessed 
lines never existed.

878                     dd1.nrec = xe->xdf1.nreff; // this is the number of 
lines in file 1
879                     dd1.ha = xe->xdf1.ha;      // this is the "hash" of 
every line. it's actually an index into a hash list, same thing
880                     dd1.rchg = xe->xdf1.rchg;  // this is the change vector 
-- the algorithm sets 1 to indicate the line does not match
881                     dd1.rindex = xe->xdf1.rindex; // this is the vector of 
pointers to line data including the actual content

The goal of the algorithm is to correctly fill rchg, which is its output.

In this buggy case, rchg already has some 1s in it (hummm i could just set it 
to 0s and see if there's a difference ... well if i do that i'd better not stop 
my debugging session in case it doesn't help!) maybe try that in a bit unsure

(gdb) s
xdl_recs_cmp (dd1=0x7ffff4b08880, off1=0, lim1=2, dd2=0x7ffff4b088c0, off2=0, 
lim2=1, kvdf=0x50c000000290, kvdb=0x50c0000002c0, need_min=1,
    xenv=0x7ffff4b08840) at 
/home/karl3/projects/zinc/third_party/xdiff/xdiffi.c:259
259             unsigned long const *ha1 = dd1->ha, *ha2 = dd2->ha;
(gdb) list
254      * (marking changed lines) is done in the two boundary reaching checks.
255      */
256     int xdl_recs_cmp(diffdata_t *dd1, long off1, long lim1,
257                      diffdata_t *dd2, long off2, long lim2,
258                      long *kvdf, long *kvdb, int need_min, xdalgoenv_t 
*xenv) {
259             unsigned long const *ha1 = dd1->ha, *ha2 = dd2->ha;
260
261             /*
262              * Shrink the box by walking through each diagonal snake (SW 
and NE).
263              */

This function is supported by xdiffi.c it's kind of the focus of the xdiff 
library. (previously i was working with xprepare.c which sets up all those 
structures).

i kind of don't expect it to be that long and complex, but i do expect to be 
highly confused by it.

let's list some more lines

261             /*
262              * Shrink the box by walking through each diagonal snake (SW 
and NE).
263              */
(gdb) list
264             for (; off1 < lim1 && off2 < lim2 && ha1[off1] == ha2[off2]; 
off1++, off2++);
265             for (; off1 < lim1 && off2 < lim2 && ha1[lim1 - 1] == ha2[lim2 
- 1]; lim1--, lim2--);
266
267             /*
268              * If one dimension is empty, then all records on the other one 
must
269              * be obviously changed.
270              */
271             if (off1 == lim1) {
272                     char *rchg2 = dd2->rchg;
273                     long *rindex2 = dd2->rindex;
(gdb) list
274
275                     for (; off2 < lim2; off2++)
276                             rchg2[rindex2[off2]] = 1;
277             } else if (off2 == lim2) {
278                     char *rchg1 = dd1->rchg;
279                     long *rindex1 = dd1->rindex;
280
281                     for (; off1 < lim1; off1++)
282                             rchg1[rindex1[off1]] = 1;
283             } else {
(gdb) list
284                     xdpsplit_t spl;
285                     spl.i1 = spl.i2 = 0;
286
287                     /*
288                      * Divide ...
289                      */
290                     if (xdl_split(ha1, off1, lim1, ha2, off2, lim2, kvdf, 
kvdb,
291                                   need_min, &spl, xenv) < 0) {
292
293                             return -1;
(gdb) list
294                     }
295
296                     /*
297                      * ... et Impera.
298                      */
299                     if (xdl_recs_cmp(dd1, off1, spl.i1, dd2, off2, spl.i2,
300                                      kvdf, kvdb, spl.min_lo, xenv) < 0 ||
301                         xdl_recs_cmp(dd1, spl.i1, lim1, dd2, spl.i2, lim2,
302                                      kvdf, kvdb, spl.min_hi, xenv) < 0) {
303
(gdb) list
304                             return -1;
305                     }
306             }
307
308             return 0;
309     }

see? not very long or complex-looking. it looks like its recursive!
I'm surprised to encounter a recursive function here, I remember when I wrote 
recursive functions before I knew about state stacks ... didn't learn too much 
new after that ... but i used to get a lot of stack overflows, surprised now to 
see recursive functions

261             /*
262              * Shrink the box by walking through each diagonal snake (SW 
and NE).
263              */
(gdb) list
264             for (; off1 < lim1 && off2 < lim2 && ha1[off1] == ha2[off2]; 
off1++, off2++);
265             for (; off1 < lim1 && off2 < lim2 && ha1[lim1 - 1] == ha2[lim2 
- 1]; lim1--, lim2--);

"snakes" are a term from myers first paper on LCS. i don't immediately recall 
what it refers to, and it's quite important for understanding the terminology 
of these algorithms (and this is likely that first one as it is still very 
popular) ... i think a snake is a set of matching lines that can be thought of 
as a diagonal path through a grid of potential line matchings, one file's line 
numbers on one axis, and one file's line numbers on another.

so in 264 and 265 above, it's looking for starting and ending lines that match, 
and increasing off and decreasing lim to narrow the problem domain

after these lines it then runs some quick checks for unchangedness, if either 
pair of offs and lims end up touching. we won't have that condition as the 
window is only partly filled. the ending lines will mismatch; the condition 
ha1[lim1 - 1] == ha2[lim2 - 1] will be false.

i guess i'd better step into that and verify i'm right!

Reply via email to