Hi everyone, I am using Python's re module to extract some data from html. The following code never returns, and I was wondering if someone can explain to me why. Is this a problem with my regexp (I tried really hard to find it?)?
The string contains three records (list items in a html page). Notice that NONE of them matches the regexp: these records do not contain the "title" element which the regexp expects inside '<span class="date">'. The weird thing is that removing any of the three records makes findall() immediately return an empty list, while if I pass all three records to findall() it never returns. Why does this happen? This is using python 2.6. Thanks so much for any help -james s="""<li class="post" key="4994199a0b80136cb3174e9e875c545e"> <h4 class="desc"><a href="http://www.sluggy.com/" rel="nofollow">Sluggy Freelance</a> </h4> <div class="commands"> <a save href="/post?url=http%3A%2F %2Fwww.sluggy.com%2F&title=Sluggy %20Freelance&copyuser=crowebert&copytags=imported%2BRSS %2BComics%2Bhumor%2Bdaily%2Bwebcomics&jump=no&partner=del" class="copy" rel="nofollow">save this</a></div> <div class="meta">to <a class="tag" href="/crowebert/imported">imported</a> <a class="tag" href="/crowebert/RSS">RSS</a> <a class="tag" href="/crowebert/ Comics">Comics</a> <a class="tag" href="/crowebert/humor">humor</a> <a class="tag" href="/crowebert/daily">daily</a> <a class="tag" href="/ crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/ ac655d3fe17873b31abeb29a1043e439" style="padding: 0 0.2em; background- color: rgb(100%, 66%, 66%);">saved by 983 other people</a> <span class="date">1945-07-18</span> </div> </li> <li class="post" key="65d66f4197fc7eba5c214fe85ed77725"> <h4 class="desc"><a href="http://www.snackbar-games.com/ gbacovers.php" rel="nofollow">Snackbar-Games.com :: GBA DS Cover Project</a> </h4> <div class="commands"> <a save href="/post?url=http%3A%2F %2Fwww.snackbar-games.com%2Fgbacovers.php&title=Snackbar-Games.com %20%3A%3A%20GBA%20DS%20Cover %20Project&copyuser=crowebert&copytags=imported%2BBookmarkMenu %2BGameStuff%2Bart%2BGBA%2Bgames %2Bnintendo&jump=no&partner=del" class="copy" rel="nofollow">save this</a></div> <div class="meta">to <a class="tag" href="/crowebert/imported">imported</a> <a class="tag" href="/ crowebert/BookmarkMenu">BookmarkMenu</a> <a class="tag" href="/ crowebert/GameStuff">GameStuff</a> <a class="tag" href="/crowebert/ art">art</a> <a class="tag" href="/crowebert/GBA">GBA</a> <a class="tag" href="/crowebert/games">games</a> <a class="tag" href="/ crowebert/nintendo">nintendo</a> ... <a class="pop" href="/url/ a65a4a0ebe813ec6e9c881331e3f9583" style="padding: 0 0.2em; background- color: rgb(100%, 84%, 84%);">saved by 26 other people</a> <span class="date">1948-12-31</span> </div> </li> <li class="post" key="690ace1f465ae419dee8145ad3871024"> <h4 class="desc"><a href="http://www.megatokyo.com/" rel="nofollow">MegaTokyo</a> </h4> <div class="commands"> <a save href="/post?url=http%3A%2F %2Fwww.megatokyo.com %2F&title=MegaTokyo&copyuser=crowebert&copytags=imported %2BBookmarkBar%2BWeekendComics%2Bcomics%2Bmanga%2Bhumor %2Bwebcomics&jump=no&partner=del" class="copy" rel="nofollow">save this</a></div> <div class="meta">to <a class="tag" href="/crowebert/imported">imported</a> <a class="tag" href="/ crowebert/BookmarkBar">BookmarkBar</a> <a class="tag" href="/crowebert/ WeekendComics">WeekendComics</a> <a class="tag" href="/crowebert/ comics">comics</a> <a class="tag" href="/crowebert/manga">manga</a> <a class="tag" href="/crowebert/humor">humor</a> <a class="tag" href="/ crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/ 94843244f0c6d80f1c6806ed5c0abec7" style="padding: 0 0.2em; background- color: rgb(100%, 60%, 60%);">saved by 2784 other people</a> <span class="date">1946-01-28</span> </div> </li>""" regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href= \"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes \">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> ) +))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</ li>", re.DOTALL) re.findall(regexp, s) -- http://mail.python.org/mailman/listinfo/python-list