[sage-devel] Re: Wolfram|Alpha appears to understand some Sage inputs

rjf Fri, 25 Feb 2011 08:36:14 -0800


On Feb 24, 9:57 pm, Ira Baxter <idbax...@semdesigns.com> wrote:
> On Feb 23, 11:37 am, rjf <fate...@gmail.com> wrote:
>
> > On Feb 23, 9:17 am, "Dr. David Kirkby" <david.kir...@onetel.net>
> > wrote:
>
> > > On 02/22/11 10:57 PM, Dr. David Kirkby wrote:
>
> > > > On 02/22/11 03:49 PM, rjf wrote:
> > [snip]. The real difficulty is
> > > >> to implement a Mathematica language parser, since the language
> > > >> fails to fit the standard expectations for computer languages.
>
> It does?  It is a context-free langauge, therefore parsable by
> any parser capable of parsing context free langauges.


The traditional efficient and easily automated version of syntax-
directed
compiling separates lexical analysis and parsing into two stages.
Lexical
analysis, or the collection of characters into tokens, can generally
be
done with a simple technique based on finite-state machines.  Read
about,
for example, the program Lex  or flex..
http://en.wikipedia.org/wiki/Flex_lexical_analyser
The generation of a parser that operates on a stream of tokens is
well-known and can be written "by hand" or through the use of a tool
like YACC or Bison
http://en.wikipedia.org/wiki/Yacc

or a little googling can show you both, working together.

The separation of stages vastly shrinks the size of the parser
description,
arguably speeds up the processing  (speed depends on lots of things),
and makes programming and debugging simpler. If the grammar that is
used fits one of the favorite sub-categories of context free languages
(LALR(1), LL(1)) the parser is going to be fast relative to the length
of the "sentence" or program being parsed.  Like linear in the length
e.g. O(n).
Now is it possible to have a grammar that is not in one of these
categories,
and even one that is ambiguous.  And it is possible to write a parser,
e.g.
using something like the technology used by Semantic Designs.  While
this can be handy to demonstrate, the parser may take considerably
more
time.. I think it may be more like O(n^3) for the "hard" parts, or
worse if it is an ambiguous grammar.  Since computers are so fast,
maybe it
doesn't matter much anymore.  When a program seems sluggish people
probably
blame it on their network bandwidth or the fact that their computer is
checking for updates in their mailbox. etc.





>
> > > > I know you said that, but I've heard different from another source. See
> > > >http://groups.google.com/group/comp.compilers/msg/8c4e6ccad3c40599
> > > > The person there, who is the CTO of a company producing this
> > > >http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
> > > > which has an option for a Mathematica parser.
>
> It does.
>
>  > > > He says Mathematica is not a particularly difficult language to
> parse,
>
> > > > and a GLR parser is a bit over the top.
>
> It isn't, and GLR is (capable of parsing a context free language)
> but AFAICT, isn't really needed to parse MMa.

For most of the language it is not.  The only problems are the places
where extra information on the context of the parse position is needed
to determine how to proceed.

>
> > > Here you can see a Mathematica parser is listed for the DMS toolkit
> > >http://www.semanticdesigns.com/Products/FrontEnds/index.html?Home=DMS...
>
> > > So I don't know what to believe Richard. You are saying the Mathematica 
> > > language
> > > can't be parsed with a conventional parser, so (you?) had to hand-write 
> > > the parser for
> > > MockMMA,
>
> Our parser for MMa consists of a relatively conventional lexical
> definition
> for tokens, and a very straightforward grammar for the language
> itself.

Relatively conventional?  Not conventional.  Very straightforward?
Well, how do you know it is correct? WRI does not publish a grammar.
You could publish it, if you chose to, though you might treat that
as proprietary.

>
> > yet someone from a commercial company selling this DMS toolkit claims
> > > the language is not particularly difficult to parse, and have a front end 
> > > for
> > > their toolkit (a GLR parser) able to parse Mathematica.
>
> > Here are my suggestions:
>
> > 1. The guy is lying. He doesn't really have a Mathematica parser that
> > works.
>
> Hmph.  For your example r[s[]] below, which you claim is *so* hard to
> parse,

Well, it is not *so* hard to parse;  I even suggest how to do it.  I
also
provided a parser to do it.   It is just one of several glitches.


> here's the output of DMS parsing it using our Mathematica grammar:
>
> C:\DMS\Domains\Mathematica\Tools\Parser\Source>run ../domainparser +
>
.... snipped long and not particularly illuminating example...
> Exiting with final status 0

If that is supposed to be the useful output of DMS, then the tool
looks rather difficult to use.  I suppose one could generate such an
output
from my parser by essentially tracing all programs that do reductions.
A parse tree dump??

I think this is better..
r[[s[]]]   -->

(Part r (s ))




>
> Yes, it parses much bigger, much more complex examples.
> JPL has used it internally.

I would be fascinated to learn why.   If JPL has a Mathematica
license,
then the parsing can be done without your program, and with almost no
difficulty.  You set up the Mathematica system to read a program and
produce
another one printed in FullForm.  Parsing FullForm is trivial.   Or
you
can change it to print Lisp.  This is not to say it is impossible to
conceive of a possible use, just that I am curious what that might be.

 Does it parse all of current 2011 MMa
> syntax?
> Probablly not, we haven't used it much recently.  But I spent 4 years
> working on a 80,000 line MMa program

That's too bad.

> so I think I understand
> the basics of the language,

Maybe.  It is certainly possible to write long programs and not use
so much of a language.

> and given its Lisp-like syntax,
> I don't think I'll be surprised.

Mathematica's FullForm is like Lisp.  Its usual input syntax is not
like Lisp at all,  having prefix, infix, postfix and bracketing
operators.

Internally, it is plausible to think of Mathematica as having a tree-
like
representation that might be encoded in Lisp, but that is no different
from any other programming language whose intermediate form is a tree.
So one could just as easily believe that C, Pascal, Fortran --- is
Lisp-like.


>   Wolfram could be crazy, though.

Probably not relevant here.  It is uncertain that Wolfram even wrote
any parser, or had someone else do it.

>
> > 2. The company has a really neat parser generating tool and a lot of
> > engineering
> > to go with it and Mathematica can be easily parsed with it.
>
> It is indeed the case that we have a neat parser generator and
> a lot of engineering.

Your company is certainly not unique in having access to parser
generator technology.

>
> DMS parses much, much harder languages, such as C++
> (famously hard to parse, ask the GNU guys who cracked their skull
> on it) and COBOL (not famously so hard, but wait till
> you try to parse the data declarations using COBOLs integer nesting
> levels
> instead of brackets, and get the nesting right), as well
> as Fortran with nested and shared DO-continues
> (getting the loop nesting right coming directly
> out of the parser).  DMS has been applied to carrying out massive
> automated transformations on C and C++ systems.
>

I'm not sure of COBOL, having successfully avoided any
detailed examination of it,  but Fortran is famously easy to
parse, once one figures out tricks for lexical analysis.
DO1I=5    is an assignment statement  but
DO1I=5,6  is the beginning of a do-loop.

I don't know why "coming directly out of the parser" is important
to anyone. Typical compilers munch on some intermediate form
repeatedly.  There would be nothing shameful about running an
additional pass over a parse output to fix "continue" statements,
if necessary.


> DMS is is the system I built after I decide Mathematica was a piece
> of junk as a transformation system.

What do you mean by "transformation system"?

>
> > 3. The company has nothing much beyond a good term project in a
> > compiler-technology
> > course (perhaps at a graduate level) plus a bunch of engineering and
> > marketing.
> > My guess is 1 + 3
>
> These are very generous guesses at the facts, Mr. Fateman.
> You might have wasted a few minutes of your time and checked
> the web site.

I did. Your home page
http://semanticdesigns.com/Products/DMS/DMSToolkit.html?Home=Main
has a pull-down menu of languages you claim to process.
It does not include Mathematica.

However, on the left  (with no links), is a listing
of languages. In that list is Mathematica.  So whether
Mathematica is truly supported or not, or is just a marketing
claim -- who could tell.

>
> > A VERY simple example.
> > r[s[]]
>
> > is legal in mathematica.
>
> Yes.
>
> > A traditional lexical analyzer, including the one apparently used by
> > mathematica,
> > typically looks for the longest string of characters that makes a
> > token.   Hence
> > a===b    has a token ===   which is  "SameQ"  even though there are
> > tokens =  and ==.
> > So the longest one is found, in general.
>
> True.

Actually, experimentation using Mathematica shows that in this case,
the longest token is NOT found. Indeed, there is no token "]]"

>
> > anyway, how does one do lexical analysis or scanning on
> > r[s[]] ?
>
> > The correct tokenization is  r, [, s, [, ], ] .   but the maximal
> > token deal returns
> > r, [, s, [, ]] .
>
> > What does this mean?  It means that the conventional separation of
> > lexical analysis
> > and parsing must be intermixed in parsing Mathematica.
>
> No, it doesn't.  It means you've made an assumption that
> might not be true.

You are correct here.  Recent experimentation shows the WRI parser
uses
a different trick, which is there is NO token "]]".

  The lexer we have for Mma is not
> one of the open source ones, but what it does is pretty
> traditional: it produces a stream of lexemes.
OK
> While our lexical machinery is in fact capable of
> asking about the left context, and we in fact use
> that to parse other ugly languages,  and indeed we might
> have done what you said, we in fact did not do so.
> We don't use any special parsing tricks for this.

OK, what you are saying is that your lexical analyzer is
not (purely?) a finite state machine. That's OK, but
for various reasons, some tradition, some efficiency,
not what people generally choose.

>
> Since you are the parsing expert, I'll leave it to
> you to figure how.

It's certainly well explored.  My favorite extension
beyond finite state lexing is described by Henry Baker
in a paper called META -- Pragmatic Parsing
(google for a copy).  This does not do full context-free
parsing, but is quite powerful.


>
> > I know of no other programming language that requires this.
>
> Try any language which has syntax to change what is considered
> to be terminator  characters.   This is possible in the legacy 4GL
> called
> Natural, but you are unlikely familiar with that.

Never heard of it.  Sounds like a crock.  Only a marketing hack
would pick that -- "Our system is programmed in Natural (r) Language."

> I believe you can also do this in Perl, worse, it
> happens at runtime so a statement containing
> the terminator can actually precede the terminator-setting statement
> by an arbitrary distanct.  Yes,  Perl is a bitch to lex, let alone
> parse.
>
> > Oh, there are also other glitches in mathematica of this sort.
>
> Yes. There's a beaut of an ambiguity that needs resolution
> between a product term and a pattern designation.

You mean the difference between a_b and  a_ b ?
That doesn't seem so hard to me.

> Most of the rest of the langauge seems pretty straightforward
> compared to most real programming languages.

A lot of the language is unexceptional, true.

>
> Mathematica is otherwise not hard to parse, and you don't need
> a hand-written parser to do it.

That is like saying a chocolate-chip cookie has no chocolate in it,
except for the parts that are chocolate.

There are other small parts here and there that appeared to be
difficult for a traditional lexer/parser division, and so I ended up
writing a recursive descent parser, which hacks here and there.
I do not know what kind of parser is used by WRI, and I would not
be surprised if it has changed from time to time. They also parse
other
forms including "traditional" and 2-d.  see
ToExpression[...,TraditionalForm]

RJF

-- 
To post to this group, send an email to sage-devel@googlegroups.com
To unsubscribe from this group, send an email to 
sage-devel+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sage-devel
URL: http://www.sagemath.org

[sage-devel] Re: Wolfram|Alpha appears to understand some Sage inputs

Reply via email to