On Feb 24, 9:57 pm, Ira Baxter <idbax...@semdesigns.com> wrote: > On Feb 23, 11:37 am, rjf <fate...@gmail.com> wrote: > > > On Feb 23, 9:17 am, "Dr. David Kirkby" <david.kir...@onetel.net> > > wrote: > > > > On 02/22/11 10:57 PM, Dr. David Kirkby wrote: > > > > > On 02/22/11 03:49 PM, rjf wrote: > > [snip]. The real difficulty is > > > >> to implement a Mathematica language parser, since the language > > > >> fails to fit the standard expectations for computer languages. > > It does? It is a context-free langauge, therefore parsable by > any parser capable of parsing context free langauges.
The traditional efficient and easily automated version of syntax- directed compiling separates lexical analysis and parsing into two stages. Lexical analysis, or the collection of characters into tokens, can generally be done with a simple technique based on finite-state machines. Read about, for example, the program Lex or flex.. http://en.wikipedia.org/wiki/Flex_lexical_analyser The generation of a parser that operates on a stream of tokens is well-known and can be written "by hand" or through the use of a tool like YACC or Bison http://en.wikipedia.org/wiki/Yacc or a little googling can show you both, working together. The separation of stages vastly shrinks the size of the parser description, arguably speeds up the processing (speed depends on lots of things), and makes programming and debugging simpler. If the grammar that is used fits one of the favorite sub-categories of context free languages (LALR(1), LL(1)) the parser is going to be fast relative to the length of the "sentence" or program being parsed. Like linear in the length e.g. O(n). Now is it possible to have a grammar that is not in one of these categories, and even one that is ambiguous. And it is possible to write a parser, e.g. using something like the technology used by Semantic Designs. While this can be handy to demonstrate, the parser may take considerably more time.. I think it may be more like O(n^3) for the "hard" parts, or worse if it is an ambiguous grammar. Since computers are so fast, maybe it doesn't matter much anymore. When a program seems sluggish people probably blame it on their network bandwidth or the fact that their computer is checking for updates in their mailbox. etc. > > > > > I know you said that, but I've heard different from another source. See > > > >http://groups.google.com/group/comp.compilers/msg/8c4e6ccad3c40599 > > > > The person there, who is the CTO of a company producing this > > > >http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html > > > > which has an option for a Mathematica parser. > > It does. > > > > > He says Mathematica is not a particularly difficult language to > parse, > > > > > and a GLR parser is a bit over the top. > > It isn't, and GLR is (capable of parsing a context free language) > but AFAICT, isn't really needed to parse MMa. For most of the language it is not. The only problems are the places where extra information on the context of the parse position is needed to determine how to proceed. > > > > Here you can see a Mathematica parser is listed for the DMS toolkit > > >http://www.semanticdesigns.com/Products/FrontEnds/index.html?Home=DMS... > > > > So I don't know what to believe Richard. You are saying the Mathematica > > > language > > > can't be parsed with a conventional parser, so (you?) had to hand-write > > > the parser for > > > MockMMA, > > Our parser for MMa consists of a relatively conventional lexical > definition > for tokens, and a very straightforward grammar for the language > itself. Relatively conventional? Not conventional. Very straightforward? Well, how do you know it is correct? WRI does not publish a grammar. You could publish it, if you chose to, though you might treat that as proprietary. > > > yet someone from a commercial company selling this DMS toolkit claims > > > the language is not particularly difficult to parse, and have a front end > > > for > > > their toolkit (a GLR parser) able to parse Mathematica. > > > Here are my suggestions: > > > 1. The guy is lying. He doesn't really have a Mathematica parser that > > works. > > Hmph. For your example r[s[]] below, which you claim is *so* hard to > parse, Well, it is not *so* hard to parse; I even suggest how to do it. I also provided a parser to do it. It is just one of several glitches. > here's the output of DMS parsing it using our Mathematica grammar: > > C:\DMS\Domains\Mathematica\Tools\Parser\Source>run ../domainparser + > .... snipped long and not particularly illuminating example... > Exiting with final status 0 If that is supposed to be the useful output of DMS, then the tool looks rather difficult to use. I suppose one could generate such an output from my parser by essentially tracing all programs that do reductions. A parse tree dump?? I think this is better.. r[[s[]]] --> (Part r (s )) > > Yes, it parses much bigger, much more complex examples. > JPL has used it internally. I would be fascinated to learn why. If JPL has a Mathematica license, then the parsing can be done without your program, and with almost no difficulty. You set up the Mathematica system to read a program and produce another one printed in FullForm. Parsing FullForm is trivial. Or you can change it to print Lisp. This is not to say it is impossible to conceive of a possible use, just that I am curious what that might be. Does it parse all of current 2011 MMa > syntax? > Probablly not, we haven't used it much recently. But I spent 4 years > working on a 80,000 line MMa program That's too bad. > so I think I understand > the basics of the language, Maybe. It is certainly possible to write long programs and not use so much of a language. > and given its Lisp-like syntax, > I don't think I'll be surprised. Mathematica's FullForm is like Lisp. Its usual input syntax is not like Lisp at all, having prefix, infix, postfix and bracketing operators. Internally, it is plausible to think of Mathematica as having a tree- like representation that might be encoded in Lisp, but that is no different from any other programming language whose intermediate form is a tree. So one could just as easily believe that C, Pascal, Fortran --- is Lisp-like. > Wolfram could be crazy, though. Probably not relevant here. It is uncertain that Wolfram even wrote any parser, or had someone else do it. > > > 2. The company has a really neat parser generating tool and a lot of > > engineering > > to go with it and Mathematica can be easily parsed with it. > > It is indeed the case that we have a neat parser generator and > a lot of engineering. Your company is certainly not unique in having access to parser generator technology. > > DMS parses much, much harder languages, such as C++ > (famously hard to parse, ask the GNU guys who cracked their skull > on it) and COBOL (not famously so hard, but wait till > you try to parse the data declarations using COBOLs integer nesting > levels > instead of brackets, and get the nesting right), as well > as Fortran with nested and shared DO-continues > (getting the loop nesting right coming directly > out of the parser). DMS has been applied to carrying out massive > automated transformations on C and C++ systems. > I'm not sure of COBOL, having successfully avoided any detailed examination of it, but Fortran is famously easy to parse, once one figures out tricks for lexical analysis. DO1I=5 is an assignment statement but DO1I=5,6 is the beginning of a do-loop. I don't know why "coming directly out of the parser" is important to anyone. Typical compilers munch on some intermediate form repeatedly. There would be nothing shameful about running an additional pass over a parse output to fix "continue" statements, if necessary. > DMS is is the system I built after I decide Mathematica was a piece > of junk as a transformation system. What do you mean by "transformation system"? > > > 3. The company has nothing much beyond a good term project in a > > compiler-technology > > course (perhaps at a graduate level) plus a bunch of engineering and > > marketing. > > My guess is 1 + 3 > > These are very generous guesses at the facts, Mr. Fateman. > You might have wasted a few minutes of your time and checked > the web site. I did. Your home page http://semanticdesigns.com/Products/DMS/DMSToolkit.html?Home=Main has a pull-down menu of languages you claim to process. It does not include Mathematica. However, on the left (with no links), is a listing of languages. In that list is Mathematica. So whether Mathematica is truly supported or not, or is just a marketing claim -- who could tell. > > > A VERY simple example. > > r[s[]] > > > is legal in mathematica. > > Yes. > > > A traditional lexical analyzer, including the one apparently used by > > mathematica, > > typically looks for the longest string of characters that makes a > > token. Hence > > a===b has a token === which is "SameQ" even though there are > > tokens = and ==. > > So the longest one is found, in general. > > True. Actually, experimentation using Mathematica shows that in this case, the longest token is NOT found. Indeed, there is no token "]]" > > > anyway, how does one do lexical analysis or scanning on > > r[s[]] ? > > > The correct tokenization is r, [, s, [, ], ] . but the maximal > > token deal returns > > r, [, s, [, ]] . > > > What does this mean? It means that the conventional separation of > > lexical analysis > > and parsing must be intermixed in parsing Mathematica. > > No, it doesn't. It means you've made an assumption that > might not be true. You are correct here. Recent experimentation shows the WRI parser uses a different trick, which is there is NO token "]]". The lexer we have for Mma is not > one of the open source ones, but what it does is pretty > traditional: it produces a stream of lexemes. OK > While our lexical machinery is in fact capable of > asking about the left context, and we in fact use > that to parse other ugly languages, and indeed we might > have done what you said, we in fact did not do so. > We don't use any special parsing tricks for this. OK, what you are saying is that your lexical analyzer is not (purely?) a finite state machine. That's OK, but for various reasons, some tradition, some efficiency, not what people generally choose. > > Since you are the parsing expert, I'll leave it to > you to figure how. It's certainly well explored. My favorite extension beyond finite state lexing is described by Henry Baker in a paper called META -- Pragmatic Parsing (google for a copy). This does not do full context-free parsing, but is quite powerful. > > > I know of no other programming language that requires this. > > Try any language which has syntax to change what is considered > to be terminator characters. This is possible in the legacy 4GL > called > Natural, but you are unlikely familiar with that. Never heard of it. Sounds like a crock. Only a marketing hack would pick that -- "Our system is programmed in Natural (r) Language." > I believe you can also do this in Perl, worse, it > happens at runtime so a statement containing > the terminator can actually precede the terminator-setting statement > by an arbitrary distanct. Yes, Perl is a bitch to lex, let alone > parse. > > > Oh, there are also other glitches in mathematica of this sort. > > Yes. There's a beaut of an ambiguity that needs resolution > between a product term and a pattern designation. You mean the difference between a_b and a_ b ? That doesn't seem so hard to me. > Most of the rest of the langauge seems pretty straightforward > compared to most real programming languages. A lot of the language is unexceptional, true. > > Mathematica is otherwise not hard to parse, and you don't need > a hand-written parser to do it. That is like saying a chocolate-chip cookie has no chocolate in it, except for the parts that are chocolate. There are other small parts here and there that appeared to be difficult for a traditional lexer/parser division, and so I ended up writing a recursive descent parser, which hacks here and there. I do not know what kind of parser is used by WRI, and I would not be surprised if it has changed from time to time. They also parse other forms including "traditional" and 2-d. see ToExpression[...,TraditionalForm] RJF -- To post to this group, send an email to sage-devel@googlegroups.com To unsubscribe from this group, send an email to sage-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sage-devel URL: http://www.sagemath.org