Simon Cropper <simoncrop...@fossworkflowguides.com> wrote: > Certainly doable but > considering the shear commonality of this task I don't understand why a > simple script does not already exist - hence my original request for > assistance.
I think you may have underestimated the complexity of the task in general. To do it for a remote website you need to specify what you consider to be a unique page. Here are some questions: Is case significant for URLs (technically it always is, but IIS sites tend to ignore it and to contain links with random permutations of case)? Are there any query parameters that make two pages distinct? Or any parameters that you should ignore? Is the order of parameters significant? I recently came across a site that not only had multiple links to identical pages with the query parameters in different order but also used a non- standard % to separate parameters instead of &: it's not so easy getting crawlers to handle that mess. Even after ignoring query parameters are there a finite number of pages to the site? For example, Apache has a spelling correction module that can effectively allow any number of spurious subfolders: I've seen a site where "/folder1/index.html" had a link to "folder2/index.html" and "/folder2/index.html" linked to "folder1/index.html". Apache helpfully accepted /folder2/folder1/ as equivalent to /folder1/ and therefore by extension also accepted /folder2/folder1/folder2/folder1/... Zope is also good at creating infinite folder structures. If you want to spider a remote site then there are plenty of off the shelf spidering packages, e.g. httrack. They have a lot of configuration options to try to handle the above gotchas. Your case is probably a lot simpler, but that's just a few reasons why it isn't actually a trivial task. Building a list by scanning a bunch of folders with html files is comparatively easy which is why that is almost always the preferred solution if possible. -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list