New submission from flying sheep: hi, i have an idea on how to make an internal change to html.parser.HTMLParser, which would expose a token generator interface.
after that, we would be able to do e.g. list(HTMLParser().tokenize(data)) or even parser = HTMLParser() for chunk in pipe_in_html(): yield from parser.tokenize(chunk) --- the changes affect excluively HTMLParser’s methods and would unfortunately require a behavior change to most (internal) parse_* methods. the changes go as follows: 1. the tokenize(data=None, end=False) method is added. it contains mainly goahead’s body with an prepended snippet to append passed data to raw_data, and all handle_* calls changed to "yield token, data". 2. all parse_* methods which returned an int and called one handle_* method are changed to return an (int, token) tuple (so that tokenize can yield the tokens) 3. goahead is changed to a skeleton implementation based on traversing the list created by tokenize, experiencing no changed behavior. all changes would only affect the behavior of the parse_* methods, and the addition of the tokenize method: the tokens are discarded if goahead, feed, or close are called. (this can of course be changed if advisable) --- since this is my first contribution, i’m unsure if i shall already add the patch, unknowing if the changes to the internal parse_* methods are acceptable at all. what do you say? PS: the tokens are named like the handle_* methods, and the current goahead implementation basically calls getattr(self, 'handle_' + token)(data) for each (token, data) tuple. This can be changed to a token: method dict or a classic “switch” elif stack. ---------- messages: 184096 nosy: flying sheep priority: normal severity: normal status: open title: Generator-based HTMLParser _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue17410> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com