New submission from flying sheep:

hi, i have an idea on how to make an internal change to html.parser.HTMLParser, 
which would expose a token generator interface.

after that, we would be able to do e.g. list(HTMLParser().tokenize(data)) or 
even

parser = HTMLParser()
for chunk in pipe_in_html():
    yield from parser.tokenize(chunk)

---

the changes affect excluively HTMLParser’s methods and would unfortunately 
require a behavior change to most (internal) parse_* methods. the changes go as 
follows:

1. the tokenize(data=None, end=False) method is added. it contains mainly 
goahead’s body with an prepended snippet to append passed data to raw_data, and 
all handle_* calls changed to "yield token, data".

2. all parse_* methods which returned an int and called one handle_* method are 
changed to return an (int, token) tuple (so that tokenize can yield the tokens)

3. goahead is changed to a skeleton implementation based on traversing the list 
created by tokenize, experiencing no changed behavior.

all changes would only affect the behavior of the parse_* methods, and the 
addition of the tokenize method: the tokens are discarded if goahead, feed, or 
close are called. (this can of course be changed if advisable)

---

since this is my first contribution, i’m unsure if i shall already add the 
patch, unknowing if the changes to the internal parse_* methods are acceptable 
at all. what do you say?

PS: the tokens are named like the handle_* methods, and the current goahead 
implementation basically calls getattr(self, 'handle_' + token)(data) for each 
(token, data) tuple. This can be changed to a token: method dict or a classic 
“switch” elif stack.

----------
messages: 184096
nosy: flying sheep
priority: normal
severity: normal
status: open
title: Generator-based HTMLParser

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue17410>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to