Hi,

albeit *not* as a daemon, we've successfully developed a Crawler in PHP
within our company. It can run for hours without a leak, if I remember
correctly it's peak memory consumption is below 64MB. However we're
crawling only a small amount of URLs, just around 10.000 .

As Brian mentioned: free your database resources, unset unused
variables. We've had one major rewrite which, besides re-architecturing
the whole thing for plugin/modularity, involved auditing every step to
make sure resources are properly freed. Usually a PHP developer doesn't
have to pay much attention to it because of the wide-used process-fork
model (but I guess I don't need to tell you that :).

But you'll get often beaten by PHP itself:
it has quite some leaks and finding/tracking them done costs time,
sometimes requires skill at the C level of PHP to properly
understand/diagnose things and if you were (unfortunately) successful in
identifying a PHP problem you've report a bug, preferable attach provide
a patch/workaround.

For example, we've had to fight http://bugs.php.net/bug.php?id=43450 .
Tracking this PHP problem was quite time consuming, involving multiple
developers, etc. Luckily we could work around this, but it was pretty
annoying.

We actually planned to release this as open source, donate it to Zend,
whatever. Legally it's done within the company, just no one had the time
for the publishing process, going over things, etc. :/

As a sidenote: We've hit the current limit of our Crawler implementation
in PHP itself: we can't to parallel fetching/processing of URLs in a
efficient manner. You can get things quick running in PHP, but doing
things with style and a serious architecture hits its limits. We've gone
to Java for such cases, made sense for us anyway as we had to move away
from Zend_Search_Lucene as it had performance problems with our index
where as Lucene/Solr was still mostly bored. Will be interesting to see
if http://code.google.com/p/marjory/ can handle this. Ops, off-topic.



HTH,
- Markus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to