Graham Toal wrote: > In fact with a decent string search algorithm (using a trie of > strings) there should be very little extra overhead in adding more > strings to be searched in parallel.
PhishingScanURLs does not use string matching. It uses regexes, and in general regex matching is NP-hard (though I don't think Clam uses backreferences which are the worst culprits.) It also involves calls to cli_html_normalise which looks scary/expensive. cli_html_normalise is almost 1100 lines long and is filled with fixed-length buffer declarations. While that does not mean necessarily that it's a security risk, it still sends shivers up my spine. Nobody should be writing 1100-line functions! See libclamav/phishcheck.c and libclamav/htmlnorm.c for the code in question. > You're right in your assessment above. It should be simple and > lightweight. That doesn't rule out scanning for URLs in the body > text, it just means you have to do so efficiently, and IMHO using > regexps is not efficient and seldom justified. Exactly. :-) So the Clam people should not be using regexes. (Our customers, in fact, always run ClamAV in conjunction with an anti-spam scanner, so it's no benefit to them to have Clam try to do anti-spam.) Regards, David. _______________________________________________ Help us build a comprehensive ClamAV guide: visit http://wiki.clamav.net http://lurker.clamav.net/list/clamav-users.html