[Python-Dev] A much better tokenize.untokenize function

2019-11-03 Thread Edward K. Ream
I am creating this post as a courtesy to anyone interested in python's tokenize 
module.

**tl;dr:** Various posts, linked below, discuss a much better replacement for 
untokenize.  Do with it as you will.

This code is very unlikely to be buggy.  *Please* let me know if you find 
problems with it.

**About the new untokenize**

This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/VPqtB9lTEAAJ
announces a replacement for the untokenize function in tokenize.py: 
https://github.com/python/cpython/blob/3.8/Lib/tokenize.py

To summarize this post:

I have "discovered" a spectacular replacement for Untokenizer.untokenize in 
python's tokenize library module:

- The wretched, buggy, and impossible-to-fix add_whitespace method is gone.
- The new code has no significant 'if' statements, and knows almost nothing 
about tokens!

As I see it, the only possible failure modes might involve the zero-length line 
0.  See the above post for a full discussion.

**Testing**

This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/5X8IDzpgEAAJ 
discusses testing issues.
Imo, the new code should easily pass all existing unit tests.

The new code also passes a new unit test for Python issue 38663: 
https://bugs.python.org/issue38663,
something existing tests fail to do, even in "compatibility mode" (2-tuples) .

Imo, the way is now clear for proper unit testing of python's Untokenize class.

In particular, it is, imo, time to remove compatibility mode.  This hack has 
masked serious issues with untokenize:
https://bugs.python.org/issue?%40columns=id%2Cactivity%2Ctitle%2Ccreator%2Cassignee%2Cstatus%2Ctype&%40sort=-activity&%40filter=status&%40action=searchid&ignore=file%3Acontent&%40search_text=untokenize&submit=search&status=-1%2C1%2C2%2C3

**Summary**

The new untokenize is the way it is written in The Book.

I have done the heavy lifting on issue 38663. Python devs are free to do with 
it as they like.

Your choice will not affect me or Leo in any way. The new code will soon become 
the foundation of Leo's token-oriented commands.

Edward

P.S. I would imagine that tokenize.untokenize is pretty much off most dev's 
radar :-)

This Engineering Notebook 
post:https://groups.google.com/d/msg/leo-editor/aivhFnXW85Q/b2a8GHvEDwAJ
discusses (in way too much detail :-) why untokenize is important to me.

To summarize that post:

Imo, python devs are biased in favor of parse trees in programs involving text 
manipulations.  I assert that the "real" black and fstringify tools would be 
significantly simpler, clearer and faster if they used python's tokenize module 
instead of python's ast module. Leo's own "beautify" and "fstringify" commands 
prove my assertion to my own satisfaction.

This opinion will be controversial, so I want to make the strongest possible 
case. I need to prove that handling tokens can be done simply and correctly in 
all cases. This is a big ask, because python's tokens are complicated.  See the 
Lexical Analysis section of the Python Language Reference.

The new untokenize furnishes the required proof, and does so elegantly.

EKR
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/AXZRDUUAUMID2CPP5A24SCG45C4ZDR6C/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: A much better tokenize.untokenize function

2019-11-03 Thread Terry Reedy

On 11/3/2019 11:12 AM, Edward K. Ream wrote:

I am creating this post as a courtesy to anyone interested in python's tokenize 
module.


As one of the 46 contributors to this module, and as one who fixed 
several untokenize bugs a few years ago, I am interested.


> **tl;dr:** Various posts, linked below, discuss a much better 
replacement for untokenize.
If that were true, I would be interested.  But as explained below, I 
don't believe it.  Even if I did, https://bugs.python.org/issue38663 
gives no evidence that you have signed the PSF contributor agreement. 
In any case, it has no PR.  We only use code that is actually 
contributed on the issue or in a PR under that agreement.


To continue, the first two lines of tokenize.untokenize() are
ut = Untokenizer()
out = ut.untokenize(iterable)

Your leoBeautify.Untokenize class appears to be completely unsuited as a 
replacement for tokenize.Untokenizer as the API for the class and method 
are incompatible with the above.


1. untokenize.Untokenizer takes no argument. leoBeautify.Untokenize() 
requires a 'contents' argument, a (unicode) string, that is otherwise 
undocumented.  At first glance, it appears that 'contents' needs to be 
something like the desired output.  (I could read the code where you 
call Untokenizer to improve my guess, but not now.)  Since our exising 
tests do not pass 'contents', they should all fail.


2. untokenize.Untokenizer.untokenize(iterable) require an iterable that 
returns "sequences with at least two elements, the token type and the 
token string."

https://docs.python.org/3/library/tokenize.html#tokenize.untokenize
One can generate python code from a sequence of pairs with a guarantee 
that the resulting code will be tokenized by the python.exe parser into 
the same sequence.


The doc continues "Any additional sequence elements are ignored."
The intent is that a tool can tokenize a file, modify the file (and 
thereby possibly invalidate the begin, end, and line elements of the 
original token stream) and generate a modified file.


[Note that the end index (4th element), when present, is not ignored but 
is used to improve white space insertion.  I believe that this should be 
documented.  What if the end index is no longer valid?  Should be also 
use the start index?]


leoBeautify.Untokenize.untokenize() requires an iterable of 5-tuples. 
It makes uses of both the start and end elements, as well as the 
mysterious required 'contents' string.


> I have "discovered" a spectacular replacement for 
Untokenizer.untokenize in python's tokenize library module:
To pass 'code == untokenize(tokenize(code))' (ignoring api details), 
there is an even more spectacular replacement: rebuild the code from the 
'line' elements.  But while the above is an essential test, it is a toy 
example with respect to applications.  The challenge is to create a 
correct and valid file from less information, possibly with only token 
type and string.  (The latter is 'compatibility mode'.)

> In particular, it is, imo, time to remove compatibility mode.
And break all usage that requires it?  Before doing much more with 
tokenize, I would want to understand more how it is actually used.


> Imo, python devs are biased in favor of parse trees in programs 
involving text manipulations.  [snip]
So why have 46 of us contributed to this one module?  This sort of 
polemic is a net negative here. We a multiple individuals with differing 
opinions.


--
Terry Jan Reedy
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/GIXYAMZOZVBK7KODW6PR6QCEAA3ZSDAT/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: A much better tokenize.untokenize function

2019-11-03 Thread Edward K. Ream
On Sun, Nov 3, 2019 at 4:14 PM Terry Reedy  wrote:

> > **tl;dr:** Various posts, linked below, discuss a much better
replacement for untokenize.

> If that were true, I would be interested.  But as explained below, I
don't believe it.

I do not believe that the tone of my post was in any way objectionable.
Yes, there was implied criticism of the code.  I see no reason for you or
anyone else to take that criticism personally.

The post script asserted only that token-based code might be a better
choice (for some projects) than ast-based code. This is in no way a
criticism of tokenize.py, or of any of its authors.
Clearly, the new code would have to be repackaged if it were to be made
part of tokenize.py. That's all I would like to say now regarding your
comments. Perhaps other devs will pick up the ball.

Edward
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/J2ASOTDQRXFGQWQBDQYW4UINY4MUJU5B/
Code of Conduct: http://python.org/psf/codeofconduct/