[New thread]

Segher Boessenkool <seg...@kernel.crashing.org>:
> > And the "simple scripts" argument dismisses the fact that those scripts
> > are built on top of complex software.  It just doesn't hold water IMHO.
> 
> This is the Unix philosophy though!

I'm now finishing a book in which I have a lot to say about this, inspired
in part by experience with reposurgeon.

One of the major concepts I introduce in the book is "semantic
locality".  This is a property of data representations and structures.
A representation has good semantic locality when the context you need
to interpret any individual part of it is reliably nearby.

A classic example of a representation wth good semantic locality is a Unix
password file.  All the information associated with a username is 
on one line. It is accordingly easy to parse and extract individual 
records.

Databases have very poor semantic locality.  So do version-control
systems.  You need a lot of context to understand any individual data
element, and that context can be arbitrarily far away in terms of
retrieval complexity and time.

The Unix philosophy of small loosely-coupled tools has few more
fervent advocates than me. But I have come to understand that
it almost necessarily fails in the presence of data representations
with poor semantic locality.

This contraint can be inverted and used as a guide to good design: 
to enable loose coupling, design your representations to have
good semantic locality.

If the Unix password file were an SQL database, could you grep it?
No. You'd have to use an SQL-specific query method rather than a
generic utility like grep that is uncoupled from the specifics of
the database's schema.

The ideal data representation for enabling the Unix ecology of tools
is textual, self-describing, and has good semantic locality.

Historically, Unix programmers have understood the importance of
textuality and self-description.  But we've lacked the concept of
and a term for semantic locality.  Having that allows one to
talk about some things that were hard to even notice before.

Here's one: the effort required to parallelize an operation on
a data structure is inversely proportional to its semantic locality.

If it has good semantic locality, you can carve it into pieces that
are easy units of work for parallelization.  If it doesn't...you
can't. Best case is you'll need locking for shared parts. Worst case
is that the referential structure of the representation is so
tangled that you can't parallelize at all.


Version-control systems rely on data structures with very poor
semantic locality.  It is therefore predictable that attacking them
with small unspecialized tools and scripting is...difficult.

It can be done, sometimes, with sufficient cleverness, but the results
are too often like making a pig fly by strapping JATO units to
it. That is to say: a brief and glorious ascent followed by entirely
predictable catastrophe.

Having trouble believing me?  OK, here's a challenge: rewrite GCC's
code-generation stage in awk/sed/m4.  

The attempt, if you actually made it, would teach you that poor
semantic locality forces complexity on the tools that have to deal
with it.

And that, ladies and gentlemen, is why reposurgeon has to be as
large and complex as it is.
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>


Reply via email to