Dan Muey wrote:
All I hear are crickets on the list anybody there today?

Maybe its because the northeastern quarter of the US was blacked out all day... muuuhhhaaaa we have power!



Here's a quick thing I was wondering since I'm no http expert.



You wondering something, never happen, ;-)....


I'd like to have a simple spider that will look at a url's diretcory and simply give me a list of files in that directory.


Well depends first of all on whether that directory will show you its contents freely. Most directories these days prevent this, spoil sports.


IE

my $files = ????? http://www.monkey.com/bannana/

And have $files be an array reference or something so I could then :


LWP will allow you to get the contents of whatever is at the end of that address, then you could parse what is returned as it is (usually) just HTML using something like HTML::TokeParser pulling out all the "a href"'s then you are left with a list of links to fetch which you can do just about anything with.


for(@{$files}) { print "-$_-\n"; }

Or somehtign like that.

Is that even possible with http and say LWP ?? Or am I all wet even thignking of trying to do that


That is precisely how spiders work, though in general they follow regular linked pages rather than looking for directory listings, and to make it worthwhile you would want to setup a non-blocking forked spider so that you can spawn a number of processes to fetch pages as each will take different amounts of time, then queue them as they come back for further processing, for which I would suggest POE, and naturally you want to store them into a database, so DBI would be handy too. And while you are at it you should probably have a duplicate checker so you don't fetch and parse the same link more than once...


Have at it! watch out google!

http://danconia.org


-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to