[New thread] Segher Boessenkool <seg...@kernel.crashing.org>: > > And the "simple scripts" argument dismisses the fact that those scripts > > are built on top of complex software. It just doesn't hold water IMHO. > > This is the Unix philosophy though!
I'm now finishing a book in which I have a lot to say about this, inspired in part by experience with reposurgeon. One of the major concepts I introduce in the book is "semantic locality". This is a property of data representations and structures. A representation has good semantic locality when the context you need to interpret any individual part of it is reliably nearby. A classic example of a representation wth good semantic locality is a Unix password file. All the information associated with a username is on one line. It is accordingly easy to parse and extract individual records. Databases have very poor semantic locality. So do version-control systems. You need a lot of context to understand any individual data element, and that context can be arbitrarily far away in terms of retrieval complexity and time. The Unix philosophy of small loosely-coupled tools has few more fervent advocates than me. But I have come to understand that it almost necessarily fails in the presence of data representations with poor semantic locality. This contraint can be inverted and used as a guide to good design: to enable loose coupling, design your representations to have good semantic locality. If the Unix password file were an SQL database, could you grep it? No. You'd have to use an SQL-specific query method rather than a generic utility like grep that is uncoupled from the specifics of the database's schema. The ideal data representation for enabling the Unix ecology of tools is textual, self-describing, and has good semantic locality. Historically, Unix programmers have understood the importance of textuality and self-description. But we've lacked the concept of and a term for semantic locality. Having that allows one to talk about some things that were hard to even notice before. Here's one: the effort required to parallelize an operation on a data structure is inversely proportional to its semantic locality. If it has good semantic locality, you can carve it into pieces that are easy units of work for parallelization. If it doesn't...you can't. Best case is you'll need locking for shared parts. Worst case is that the referential structure of the representation is so tangled that you can't parallelize at all. Version-control systems rely on data structures with very poor semantic locality. It is therefore predictable that attacking them with small unspecialized tools and scripting is...difficult. It can be done, sometimes, with sufficient cleverness, but the results are too often like making a pig fly by strapping JATO units to it. That is to say: a brief and glorious ascent followed by entirely predictable catastrophe. Having trouble believing me? OK, here's a challenge: rewrite GCC's code-generation stage in awk/sed/m4. The attempt, if you actually made it, would teach you that poor semantic locality forces complexity on the tools that have to deal with it. And that, ladies and gentlemen, is why reposurgeon has to be as large and complex as it is. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a>