Thanks for a rather detailed explanation of some of what we have been discussing, Chris. The overall outline is about what I assumed was there but some of the details were, to put it politely, fuzzy.
I see resemblances to something like how a web page is loaded and operated. I mean very different but at some level not so much. I mean a typical web page is read in as HTML with various keyword regions expected such as <BODY> ... </BODY> or <DIV ...> ... </DIV> with things often cleanly nested in others. The browser makes nodes galore in some kind of tree format with an assortment of objects whose attributes or methods represent aspects of what it sees. The resulting treelike structure has names like DOM. To a certain approximation, this tree starts a certain way but is regularly being manipulated (or perhaps a copy is) as it regularly is looked at to see how to display it on the screen at the moment based on the current tree contents and another set of rules in Cascading Style Sheets. But bits and pieces of JavaScript are also embedded or imported that can read aspects of the tree (and more) and modify the contents and arrange for all kinds of asynchronous events when bits of code are invoked such as when you click a button or hover or when an image finishes loading or every 100 milliseconds. It can insert new objects into the DOM too. And of course there can be interactions with restricted local storage as well as with servers and code running there. It is quite a mess but in some ways I see analogies. Your program reads a stream of data and looks for tokens and eventually turns things into a tree of sorts that represents relationships to a point. Additional structures eventually happen at run time that let you store collections of references to variables such as environments or namespaces and the program derived from the trees makes changes as it goes and in a language like Python can even possibly change the running program in some ways. These are not at all the same thing but share a certain set of ideas and methods and can be very powerful as things interact. In the web case, the CSS may search for regions with some class or ID or that are the third element of a bullet list and more, using powerful tools like jQuery, and make changes. A CSS rule that previously ignored some region as not having a particular class, might start including it after a JavaScript segment is aroused while waiting on an event listener for say a mouse hovering over an area and then changes that part of the DOM (like a node) to be in that class. Suddenly the area on your screen changes background or whatever the CSS now dictates. We have multiple systems written in an assortment of "languages" that complement each other. Some running programs, especially ones that use asynchronous methods like threads or callbacks on events, such as a GUI, can effectively do similar things. In effect the errors in the web situation have such analogies too as in what happens if a region of HTML is not well-formed or uses a keyword not recognized. This becomes even more interesting in XML where anything can be a keyword and you often need other kinds of files (often also in ML) to define what the XML can be like and what restrictions it may have such as can a <BOOK> have multiple authors but only one optional publication date and so on. It can be fascinating and highly technical. So I am up for a challenge of studying anything from early compilers for languages of my youth to more recent ways including some like what you show. I have time to kill and this might be more fun than other things, for a while. There was a guy around a few years ago who suggested he would create a system where you could create a series of some kind of configuration files for ANY language and his system would them compile or run programs for each and every such language? Was that on this forum? What ever happened to him? But although what he promised seemed a bit too much, I can see from your comments below how in some ways a limited amount of that might be done for some subset of languages which can be parsed and manipulated as described. -----Original Message----- From: Python-list <python-list-bounces+avi.e.gross=gmail....@python.org> On Behalf Of Chris Angelico Sent: Monday, October 10, 2022 11:55 PM To: python-list@python.org Subject: Re: What to use for finding as many syntax errors as possible. On Tue, 11 Oct 2022 at 14:26, <avi.e.gr...@gmail.com> wrote: > > I stand corrected Chris, and others, as I pay the sin tax. > > Yes, there are many kinds of errors that logically fall into different > categories or phases of evaluation of a program and some can be > determined by a more static analysis almost on a line by line (or > "statement" or "expression", ...) basis and others need to sort of > simulate some things and look back and forth to detect possible > incompatibilities and yet others can only be detected at run time and > likely way more categories depending on the language. > > But when I run the Python interpreter on code, aren't many such phases > done interleaved and at once as various segments of code are parsed > and examined and perhaps compiled into block code and eventually executed? Hmm, depends what you mean. Broadly speaking, here's how it goes: 0) Early pre-parse steps that don't really matter to most programs, like checking character set. We'll ignore these. 1) Tokenize the text of the program into a sequence of potentially-meaningful units. 2) Parse those tokens into some sort of meaningful "sentence". 3) Compile the syntax tree into actual code. 4) Run that code. Example: >>> code = """def f(): ... print("Hello, world", 1>=2) ... print(Ellipsis, ...) ... return True ... """ >>> In step 1, all that happens is that a stream of characters (or bytes, depending on your point of view) gets broken up into units. >>> for t in tokenize.tokenize(iter(code.encode().split(b"\n")).__next__): ... print(tokenize.tok_name[t.exact_type], t.string) It's pretty spammy, but you can see how the compiler sees the text. Note that, at this stage, there's no real difference between the NAME "def" and the NAME "print" - there are no language keywords yet. Basically, all you're doing is figuring out punctuation and stuff. Step 2 is what we'd normally consider "parsing". (It may well happen concurrently and interleaved with tokenizing, and I'm giving a simplified and conceptualized pipeline here, but this is broadly what Python does.) This compares the stream of tokens to the grammar of a Python program and attempts to figure out what it means. At this point, the linear stream turns into a recursive syntax tree, but it's still very abstract. >>> import ast >>> ast.dump(ast.parse(code)) "Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[], args=[], kwonlyargs=[], kw_defaults=[], defaults=[]), body=[Expr(value=Call(func=Name(id='print', ctx=Load()), args=[Constant(value='Hello, world'), Compare(left=Constant(value=1), ops=[GtE()], comparators=[Constant(value=2)])], keywords=[])), Expr(value=Call(func=Name(id='print', ctx=Load()), args=[Name(id='Ellipsis', ctx=Load()), Constant(value=Ellipsis)], keywords=[])), Return(value=Constant(value=True))], decorator_list=[])], type_ignores=[])" (Side point: I would rather like to be able to pprint.pprint(ast.parse(code)) but that isn't a thing, at least not currently.) This is where the vast majority of SyntaxErrors come from. Your code is a sequence of tokens, but those tokens don't mean anything. It doesn't make sense to say "print(def f[return)]" even though that'd tokenize just fine. The trouble with the notion of "keeping going after finding an error" is that, when you find an error, there are almost always multiple possible ways that this COULD have been interpreted differently. It's as likely to give nonsense results as actually useful ones. (Note that, in contrast to the tokenization stage, this version distinguishes between the different types of word. The "def" has resulted in a FunctionDef node, the "print" is a Name lookup, and both "..." and "True" have now become Constant nodes - previously, "..." was a special Ellipsis token, but "True" was just a NAME.) Step 3: the abstract syntax tree gets parsed into actual runnable code. This is where that small handful of other SyntaxErrors come from. With these errors, you absolutely _could_ carry on and report multiple; but it's not very likely that there'll actually *be* more than one of them in a file. Here's some perfectly valid AST parsing: >>> ast.dump(ast.parse("from __future__ import the_past")) "Module(body=[ImportFrom(module='__future__', names=[alias(name='the_past')], level=0)], type_ignores=[])" >>> ast.dump(ast.parse("from __future__ import braces")) "Module(body=[ImportFrom(module='__future__', names=[alias(name='braces')], level=0)], type_ignores=[])" >>> ast.dump(ast.parse("def f():\n\tdef g():\n\t\tnonlocal x\n")) "Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[], args=[], kwonlyargs=[], kw_defaults=[], defaults=[]), body=[FunctionDef(name='g', args=arguments(posonlyargs=[], args=[], kwonlyargs=[], kw_defaults=[], defaults=[]), body=[Nonlocal(names=['x'])], decorator_list=[])], decorator_list=[])], type_ignores=[])" If you were to try to actually compile those to bytecode, they would fail: >>> compile(ast.parse("from __future__ import braces"), "-", "exec") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "-", line 1 SyntaxError: not a chance And finally, step 4 is actually running the compiled bytecode. Any errors that happen at THIS stage are going to be run-time errors, not syntax errors (a SyntaxError raised at run time would be from compiling other code). > So is the OP asking for something other than a Python Interpreter that > normally halts after some kind of error? Tools like a linter may > indeed fit that mold. Yes, but linters are still going to go through the same process laid out above. So if you have a huge pile of code that misuses "await" in non-async functions, sure! Maybe a linter could half-compile the code, then probe it repeatedly until it gets past everything. That's not exactly a common case, though. More likely, you'll have parsing errors, and the only way to "move past" a parsing error is to guess at what token should be added or removed to make it "kinda work". Alternatively, you'll get some kind of messy heuristics to try to restart parsing part way down, but that's pretty imperfect too. > This may limit some of the objections of when an error makes it hard > for the parser to find some recovery point to continue from as no code > is being run and no harmful side effects happen by continuing just an analysis. It's pretty straight-forward to ensure that no code is run - just compile it without running it. It's still possible to attack the compiler itself, but far less concerning than running arbitrary code. Attacks on the compiler are usually deliberate; code you don't want to run yet might be a perfectly reasonable call to os.unlink()... > Time to go read some books about modern ways to evaluate a language > based on more mathematical rules including more precisely what is syntax versus ... > > Suggestions? > I'd recommend looking at Python's compile() function, the ast and tokenizer modules, and everything that they point to. ChrisA -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list