--- Begin Message ---
Package: planet-venus
Version: 0~git9de2109-4
Severity: important
Tags: patch
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Dear Maintainer,
after updating python-html5lib to 0.999999999-1, planet-venus fails
with:
ERROR:planet.runner:TypeError: __init__() got an unexpected keyword argument
'encoding'
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/planet/spider.py",
line 484, in spiderPlanet
writeCache(uri, feed_info, data)
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/planet/spider.py",
line 293, in writeCache
reconstitute.source(xdoc.documentElement,data.feed,data.bozo,format)
ERROR:planet.runner: File
"/usr/lib/python2.7/dist-packages/planet/reconstitute.py", line 240, in source
content(xsource, 'subtitle', source.get('subtitle_detail',None), bozo)
ERROR:planet.runner: File
"/usr/lib/python2.7/dist-packages/planet/reconstitute.py", line 170, in content
html = parser.parse(xdiv % detail.value, encoding="utf-8")
ERROR:planet.runner: File
"/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 235, in parse
self._parse(stream, False, None, *args, **kwargs)
ERROR:planet.runner: File
"/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 85, in _parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
ERROR:planet.runner: File
"/usr/lib/python2.7/dist-packages/html5lib/_tokenizer.py", line 36, in __init__
self.stream = HTMLInputStream(stream, **kwargs)
ERROR:planet.runner: File
"/usr/lib/python2.7/dist-packages/html5lib/_inputstream.py", line 151, in
HTMLInputStream
return HTMLBinaryInputStream(source, **kwargs)
Traceback (most recent call last):
File "/usr/bin/planet", line 143, in <module>
doc = splice.splice()
File "/usr/lib/python2.7/dist-packages/planet/splice.py", line 84, in splice
reconstitute.source(xdoc.documentElement, data.feed, None, None)
File "/usr/lib/python2.7/dist-packages/planet/reconstitute.py", line 240, in
source
content(xsource, 'subtitle', source.get('subtitle_detail',None), bozo)
File "/usr/lib/python2.7/dist-packages/planet/reconstitute.py", line 170, in
content
html = parser.parse(xdiv % detail.value, encoding="utf-8")
File "/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 235, in
parse
self._parse(stream, False, None, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/html5lib/html5parser.py", line 85, in
_parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
File "/usr/lib/python2.7/dist-packages/html5lib/_tokenizer.py", line 36, in
__init__
self.stream = HTMLInputStream(stream, **kwargs)
File "/usr/lib/python2.7/dist-packages/html5lib/_inputstream.py", line 151,
in HTMLInputStream
return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'
Fixing this results in another error regarding the sanitizer. See [1] and [2].
The attached patch makes planet-venus work again. It should probably be
incorporated into debian/patches/html5lib-no_XHTMLSerializer.patch.
Cheers,
sur5r
[1] https://github.com/html5lib/html5lib-python/issues/277
[2] https://github.com/html5lib/html5lib-python/issues/72
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEe/X2rDZDH11A3BN6TPKyGPVNrj0FAlg65/MACgkQTPKyGPVN
rj1l6BAAqQyCb4TzzZ5ueiBhp5OTY7U5z+8SP4rquuD+4bMaSq6sZuDkwH/mk71E
+rXt5/EsUezRoIjvmRpOlP/1ANDNnidhoxz7OttHBiRWZQUZ/QG6HlSF4t3BOOUY
J87zTwMJJC0aM2CRod5K30EUX2eDnmbrEyMJ5DqL2aSl+V8I7tH+9ttTK7myeW25
C0y8S2D3GWCn3pjMh3PsKk6zEkX+3niERpXfXNHytlrYuBEJI4hG9xi6g7sHN9ds
dhaiopTbUonEQhHkpzKwmPc08IcMvwO/xTCecrtsiTGs1wRi5I7uxmRwySljVzDS
AuIm3cEz/Qy8SzDkDc7eWYrk7LxYE2vcJ4PZlNy75sSWoDsq0LYbmcHQq7vtrHhd
dlctzLSEx9v0MUtNcjz6iCCdFBnVdJS3VTLjCqmlt4p1c0LgbeZeuokmIhIb3s/Q
kClegb1wcuqcw3PKxMjZdUWEg7/gh84aDf/d2kb2+r+B54XXhysQM9eXpTPm24Hx
ushQZ99At/mxFEbY1UmlvUmMjfNdEV402riDUlKUGR7f+10dWvxY2cRRSZc+fXGj
cmAeT8xZa8aAZ2ou9Qmq/8/ixK9ez+A0VFgKBV69wqPzQx2fG3Omy3AY+/encjGp
cjF0QqpbRc5fswiNI9e7Y5b2E2R1kiSo6qduSB323ejYf0tQHAI=
=Lnir
-----END PGP SIGNATURE-----
--- a/planet/scrub.py 2016-02-17 00:00:00.000000000 +0100
+++ b/planet/scrub.py 2016-11-27 13:47:47.000000000 +0100
@@ -139,12 +139,12 @@
node['type']='text/html'
if not doc:
- from html5lib import html5parser, treebuilders, sanitizer
- p=html5parser.HTMLParser(tree=treebuilders.getTreeBuilder('dom'), tokenizer=sanitizer.HTMLSanitizer)
- doc = p.parseFragment(node['value'], encoding='utf-8')
+ from html5lib import html5parser, treebuilders
+ p=html5parser.HTMLParser(tree=treebuilders.getTreeBuilder('dom'))
+ doc = p.parseFragment(node['value'])
from html5lib import treewalkers, serializer
walker = treewalkers.getTreeWalker('dom')(doc)
- xhtml = serializer.HTMLSerializer(inject_meta_charset = False)
+ xhtml = serializer.HTMLSerializer(inject_meta_charset = False, sanitize=True)
tree = xhtml.serialize(walker, encoding='utf-8')
node['value'] = ''.join([str(token) for token in tree])
--- a/planet/reconstitute.py 2016-02-17 00:00:00.000000000 +0100
+++ b/planet/reconstitute.py 2016-11-27 13:47:50.000000000 +0100
@@ -167,7 +167,7 @@
if detail.type.find('xhtml')<0 or bozo:
parser = html5parser.HTMLParser(tree=treebuilders.getTreeBuilder('dom'))
- html = parser.parse(xdiv % detail.value, encoding="utf-8")
+ html = parser.parse(xdiv % detail.value, override_encoding="utf-8")
for body in html.documentElement.childNodes:
if body.nodeType != Node.ELEMENT_NODE: continue
if body.nodeName != 'body': continue
--- End Message ---