On Tue, 2007-07-03 at 23:57 +0200, Bas Zoetekouw wrote: 
> I was running webcheck on the debian website when it crashed:
> 
>  > webcheck:   http://www.nl.debian.org/intl/french/typographie
>  > webcheck:   ftp://ftp.icm.edu.pl/pub/Linux/distributions/debian-non-US/
>  > Traceback (most recent call last):
>  >   File "/usr/bin/webcheck", line 249, in ?
>  >     main()
>  >   File "/usr/bin/webcheck", line 211, in main
>  >     site = serialize.deserialize(fp)
>  >   File "/usr/share/webcheck/serialize.py", line 329, in deserialize
>  >     _deserialize_link(link, key, value)
>  >   File "/usr/share/webcheck/serialize.py", line 284, in _deserialize_link
>  >     link.add_linkproblem(_readstring(value, False))
>  >   File "/usr/share/webcheck/serialize.py", line 167, in _readstring
>  >     return str(_unescape(txt))
>  > UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in 
> position 205: ordinal not in range(128)
> 
> The last page that it shows (ftp://ftp.icm.edu.pl/...) doesn't seem to
> be the cause of this though, as that one parses fine when I run
> webcheck on it directly.
> 
> Ive put the webcheck.dat file at
> http://zoetekouw.net/Zooi/webcheck.dat.bz2 as the BTS won't accept is
> as an attachment.  Note that it does take quite a while running
> webcheck in continuation mode with this webcheck.dat before the crash
> occurs (>30 minutes or so).

The stack trace from above is not from a crash while crawling the site
but of a problem reading the saved state (the second run with
--continue). I can reproduce the problem with reading webcheck.dat but
for now not the crash while crawling (test is ongoing).

The problem with webcheck.dat was due to unicode data being written to
but plain ascii was expected while reading. I've made a fix for that
(patch attached if you want to test) and I will upload a new release in
the coming days.

Your report however points out another problem: reading webcheck.dat
shouldn't take 30 minutes. I've done some quick tests but I haven't been
able to pinpoint this one yet. There seems to be something going wrong
with the buffering.

-- 
-- arthur - [EMAIL PROTECTED] - http://people.debian.org/~adejong --
Index: crawler.py
===================================================================
--- crawler.py	(revision 335)
+++ crawler.py	(working copy)
@@ -501,8 +501,7 @@
         # lowercase anchor
         anchor = anchor.lower()
         # convert the url to a link object if we were called with a url
-        if type(parent) is str:
-            parent = self.site.get_link(self.__checkurl(parent))
+        parent = self.__tolink(parent)
         # add anchor
         if anchor in self.reqanchors:
             if parent not in self.reqanchors[anchor]:
Index: serialize.py
===================================================================
--- serialize.py	(revision 335)
+++ serialize.py	(working copy)
@@ -156,15 +156,12 @@
         return None
     return int(txt)
 
-def _readstring(txt, useunicode=True):
+def _readstring(txt):
     """Transform the string read from a key/value pair
     to a string that can be used."""
     if txt == 'None':
         return None
-    if useunicode:
-        return _unescape(txt)
-    else:
-        return str(_unescape(txt))
+    return _unescape(txt)
 
 def _readdate(txt):
     """Interpret the string as a date value."""
@@ -175,7 +172,7 @@
     return None
 
 def _readlist(txt):
-    """nterpret the string as a list of strings."""
+    """Interpret the string as a list of strings."""
     return [ _readstring(x.strip())
              for x in _commapattern.findall(txt) ]
 
@@ -240,7 +237,7 @@
     """The data in the key value pair is fed into the site."""
     debugio.debug("%s=%s" % (key, value))
     if key == 'internal_url':
-        site.add_internal(_readstring(value, False))
+        site.add_internal(_readstring(value))
     elif key == 'internal_re':
         site.add_internal_re(_readstring(value))
     elif key == 'external_re':
@@ -254,14 +251,14 @@
     """The data in the kay value pair is fed into the link."""
     link._ischanged = True
     if key == 'child':
-        link.add_child(_readstring(value, False))
+        link.add_child(_readstring(value))
     elif key == 'embed':
-        link.add_embed(_readstring(value, False))
+        link.add_embed(_readstring(value))
     elif key == 'anchor':
-        link.add_anchor(_readstring(value, False))
+        link.add_anchor(_readstring(value))
     elif key == 'reqanchor':
         (url, anchor) = _readlist(value)
-        link.add_reqanchor(str(url), str(anchor))
+        link.add_reqanchor(url, anchor)
     elif key == 'isfetched':
         link.isfetched = _readbool(value)
     elif key == 'ispage':
@@ -271,19 +268,19 @@
     elif key == 'size':
         link.size = _readint(value)
     elif key == 'mimetype':
-        link.mimetype = _readstring(value, False)
+        link.mimetype = str(_readstring(value))
     elif key == 'encoding':
-        link.encoding = _readstring(value, False)
+        link.encoding = str(_readstring(value))
     elif key == 'title':
         link.title = _readstring(value)
     elif key == 'author':
         link.author = _readstring(value)
     elif key == 'status':
-        link.status = _readstring(value, False)
+        link.status = _readstring(value)
     elif key =='linkproblem':
-        link.add_linkproblem(_readstring(value, False))
+        link.add_linkproblem(_readstring(value))
     elif key =='pageproblem':
-        link.add_pageproblem(_readstring(value, False))
+        link.add_pageproblem(_readstring(value))
     elif key == 'redirectdepth':
         link.redirectdepth = _readint(value)
     else:

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to