On Fri, 15 Apr 2005, C. Scott Ananian wrote: > > I think examining the rsync algorithms should convince you that finding > common chunks can be fairly efficient.
Note that "efficient" really depends on how good a job you want to do, so you can tune it to how much CPU you can afford to waste on the problem. For example, my example had this thing where we merged five different functions into one function, and it is truly pretty efficient to find things like that _IF_ we only look at the files that changed (since the set of files that change in any one particular commit tends to be small, relative to the whole repository). There are many good algorithms for finding "common code", and with modern hardware that is basically instantaneous if you look at a few tens of files. For example, people wrote efficient things to compare _millions_ of lines of code for the whole SCO saga - you can do quite well. Some googling comes up with for example http://minnie.tuhs.org/Programs and applying those to a smallish set of files is quite efficient. What is _not_ necessarily as easy is the situation where you notice that a new set of lines appeared, but you don't see any place that matches that set of lines in the set of CHANGED files. That's actually quite common, ie let's say that you have a new filesystem or a new driver, and almost always it's based on a template or something, and you _would_ be able to see where that template came from, except it's not in that _changed_ set. And that is still doable, but now you really have to compare against the whole tree if you want to do it. Even _that_ is actually efficient if you cache the hashes - that's how the comparison tools compare two totally independent trees against each other, and it makes it practically possible to do even that expensive O(n**2) operation in reasonable time. It's certainly possible to do exactly the same thing for the "new code got added, does it bear any similarity to old code" case. Note! This is a question that is relevant and actually is in the realm of the "possible to find the answer interactively". It may fairly expensive, but the point is that this is the kind of relevant question that really does depend on the fundamental notion that "data matters more than any local changes". And when you think about the problem in that form, you find these kinds of interesting questions that you _can_ answer. Because the way git identifies data, the example "is there any other relevant code that may actually be similar to the newly added code" is actually not that hard to do in git. Remember: the way to answer that question is to have a cache of hashes of the contents. Guess what git _is_? You can now index your line-based hashes of contents against the _object_ hashes that git keeps track of, and you suddenly have an efficient way to actually look up those hashes. NOTE! All of this is outside the scope of git itself. This is all "visualization and comparison tools" built up on top of git. And I'm not at all interested in writing those tools myself, and I'm absolutely not signing up for that part. All I'm arguing for is that the git architecture is actually a very good architecture for doing these kinds of very very cool tools. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html