Re: What to use for finding as many syntax errors as possible.

Chris Angelico Mon, 10 Oct 2022 15:51:02 -0700

On Tue, 11 Oct 2022 at 09:18, Cameron Simpson <[email protected]> wrote:
>
> On 11Oct2022 08:02, Chris Angelico <[email protected]> wrote:
> >There's a huge difference between non-fatal errors and syntactic
> >errors. The OP wants the parser to magically skip over a fundamental
> >syntactic error and still parse everything else correctly. That's
> >never going to work perfectly, and the OP is surprised at this.
>
> The OP is not surprised by this, and explicitly expressed awareness that
> resuming a parse had potential for "misparsing" further code.
>
> I remain of the opinion that one could resume a parse at the next
> unindented line and get reasonable results a lot of the time.


The next line at the same indentation level as the line with the
error, or the next flush-left line? Either way, there's a weird and
arbitrary gap before you start parsing again, and you still have no
indication of what could make sense. Consider:

if condition # no colon
    code
else:
    code

To actually "restart" parsing, you have to make a guess of some sort.
Maybe you can figure out what the user meant to do, and parse
accordingly; but if that's the case, keep going immediately, don't
wait for an unindented line. If you want for a blank line followed by
an unindented line, that might help with a notion of "next logical
unit of code", but it's very much dependent on the coding style, and
if you have a codebase that's so full of syntax errors that you
actually want to see more than one, you probably don't have a codebase
with pristine and beautiful code layout.

> In fact, I expect that one could resume tokenising at almost any line
> which didn't seem to be inside a string and often get reasonable
> results.

"Seem to be"? On what basis?

> I grew up with C and Pascal compilers which would _happily_ produce many
> complaints, usually accurate, and all manner of syntactic errors. They
> didn't stop at the first syntax error.

Yes, because they work with a much simpler grammar. But even then,
most syntactic errors (again, this is not to be confused with semantic
errors - if you say "char *x = 1.234;" then there's no parsing
ambiguity but it's not going to compile) cause a fair degree of
nonsense afterwards.

The waters are a bit muddied by some things being called "syntax
errors" when they're actually nothing at all to do with the parser.
For instance:

>>> def f():
...     await q
...
  File "<stdin>", line 2
SyntaxError: 'await' outside async function

This is not what I'm talking about; there's no parsing ambiguity here,
and therefore no difficulty whatsoever in carrying on with the
parsing. You could ast.parse() this code without an error. But
resuming after a parsing error is fundamentally difficult, impossible
without guesswork.

> All you need in principle is a parser which goes "report syntax error
> here, continue assuming <some state>". For Python that might mean
> "pretend a missing final colon" or "close open brackets" etc, depending
> on the context. If you make conservative implied corrections you can get
> a reasonable continued parse, enough to find further syntax errors.

And, more likely, you'll generate a lot of nonsense. Take something like this:

items = [
    item[1],
    item2],
    item[3],
]

As a human, you can easily see what the problem is. Try teaching a
parser how to handle this. Most likely, you'll generate a spurious
error - maybe the indentation, maybe the intended end of the list -
but there's really only one error here. Reporting multiple errors
isn't actually going to be at all helpful.

> I remember the Pascal compiler in particular had a really good "you
> missed a semicolon _back there_" mode which was almost always correct, a
> nice boon when correcting mistakes.
>

Ahh yes. Design a language with strict syntactic requirements, and
it's not too hard to find where the programmer has omitted them. Thing
is.... Python just doesn't HAVE those semicolons. Let's say that a
variant Python required you to put a U+251C ├ at the start of every
statement, and U+2524 ┤ at the end of the statement. A whole lot of
classes of error would be extremely easy to notice and correct, and
thus you could resume parsing; but that isn't benefiting the
programmer any. When you don't have that kind of information
duplication, it's a lot harder to figure out how to cheat the fix and
go back to parsing.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: What to use for finding as many syntax errors as possible.

Reply via email to