In comp.lang.scheme toby <t...@telegraphics.com.au> wrote: > In my opinion Knuth believed in the value of literate programming for > similar reasons: To try to exploit existing cognitive training. If > your user base is familiar with English, or mathematical notation, or > some other lexicography, try to exploit the pre-wired associations. > Clearly this involves some intuition.
I've dabbled in the linguistics field a little bit and this thread made me remember a certain topic delineated in this paper: http://complex.upf.es/~ricard/SWPRS.pdf This paper uses a concept that I would say needs more investigation: The mapping of syntax onto metric spaces. In fact, the paper literally makes a statement about how physically close two words are when mapped onto a line of text and language networks build graph topologies based upon the distance of words from each other as they relate in syntactical form. It seems the foundations for being able to construct a metric to "measure" syntax is somewhat available. If you squint at the math in the above paper, you can see how it can apply to multidimensional metric spaces and syntax in different modalities (like the comparison between reading written words and hearing spoken words). For example, here the same semantic idea in a regex defined three ways: english-like: Match the letter "a" followed by one or more letter "p"s followed by the letter "l" then optionaly the letter "e". scheme-like: (regex (char-class \#a) (re+ (char-class \#p)) (char-class \#l) (re? (char-class \#e))) perl-like: /ap+le?/ The perl-like one would be the one chosen by most programmers, but the various reasons why it is chosen will fluctuate. Can we do better? Can we give a predictable measurement of some syntactic quantities that we can optimize and get a predictable answer? Most people would say the perl-like form has the least amount of "syntactic garbage" and is "short". How do we meaningfully define those two terms? One might say "syntactic garbage" is syntax which doesn't relate *at all* to the actual semantic objects, and "short" might mean elimination of redundant syntax and/or transformation of explicit syntax into implicit syntax already available in the metric space. What do I mean by explicit versus implicit? In the scheme-like form, we explicitly denote the evaluation and class of the various semantic objects of the regex before applying the "regex" function across the evaluated arguments. Whitespace is used to separate the morphemes of only the scheme syntax. The embedding metric space of the syntax (meaning the line of text indexed by character position) does nothing to help or hinder the expression of the semantic objects. In the perl-like form, we implicitly denote the evaluation of the regex by using a prototyping-based syntax, meaning the inherent qualities of the embedding metric space are utilized. This specifically means that evaluation happens left to right in reading order and semantic objects evaluate directly to themselves and take as arguments semantic objects as related directly in the embedding metric space. For example, the ? takes as an argument the semantic object at location index[?]-1. Grouping parenthsis act as a VERY simple tokenization system to group multiple objects into one syntactical datum. Given the analysis, it seems the perl-like regex generally auto-quote themselves as an advantageous use of the embeded metric space in which they reside (aka the line of text). The english-like form is the worst for explicit modeling because it abstracts the semantic objects into a meta-space that is then referenced by the syntax in the embedding space of the line itself. In human reasoning, that is what the quotes mean. Out of the three syntax models, only the perl one has no redundant syntax and takes up the smallest amount of the metric space into which it is embedded--clearly seen by observation. Consider if we have quantities "syntactic datums" versus "semantic objects", then one might make an equation like this: semantic objects expressiveness = ---------------- syntactic datums And of course, expressiveness rises as semantic objects begin to outweigh syntactic objects. This seems a very reasonable, although simplistic, model to me and would be a good start in my estimation. I say simplistic because the semantic objects are not taken in relation to themselves on the metric space the syntactic datums are embeded, I'll get to this in a bit. Now, what do we do with the above equation, well, we define what we can, and then optimize the hell out of it. "semantic objects" for a computer language is probably a fixed quantity, there are only so many operators and grouping constructs, and usually very few, meaning one, function description semantic objects. So, we are left with a free variable of "syntactic datums" that we should minimize to be as small as possible. If we don't take into consideration the topological mapping of the semantic objects into the syntactic datum space, then the number of semantic objects at least equals the number of syntactic datums. Expressiveness is simply one in the best case, and this would be pretty poor for sure. It could be worse--just add redundant syntactic datums, now it is worse! If we take into consideration the mapping of the semantic objects onto the metric space, then maybe this becomes the new equation: distance(index[obj1], index[obj2], ..., index[objn]) expressiveness = ---------------------------------------------------- syntactic datums The distance() function in this new model is the centroid of the syntactic datum which represent the semantic object. (Of course, there are other models of expressiveness which would need to be explored based upon this incremental idea I'm presenting.) Larger distances between semantic objects, or larger syntactic constructs, would mean less expressiveness. This is interesting, because it can give a rough number that sorts the three syntactic models I provided for the regex and it would follow conventional wisdom. This is a decriptivist model of syntax measurement. Obviously, the fun is can we search an optimization tree to try find an optimal set of syntactic datums to represent a known and finite set of semantic objects taking into consideration the features of the metric space into which the syntactic datums are embedded? Maybe for another day.... Later, -pete -- http://mail.python.org/mailman/listinfo/python-list