On Sat, 15 Jul 2017 13:13:31 -0500 (CDT)
David B Funk wrote:

> > On Sat, 15 Jul 2017, Antony Stone wrote:

> One observation; that list has over 10,000 entries which means that
> you're going to be adding thousands of additional rules to SA on an
> automated basis.
> 
> Some time in the past other people had worked up automated mechanisms
> to add large numbers of rules derived from example spam messages (Hi
> Chris;) and there were performance issues (significant increase in SA
> load time, memory usage, etc).

I'm not an expert on perl internals, so I may be wide of the mark,
but I would have thought that the most efficient way to do this
using uri rule(s) would be to generate a single regex recursively so
that scanning would be O(log(n)) in the number of entries rather than
O(n). 

You start by stripping the http://  and then make a list of the all
the first characters, then for each character you recurse. You end up
with something like 

^http://(a(...)|b(...)...|z(...))

Where each of the (...) contains a similar list of alternations to the
top level. 

You can take this a bit further and detect when the all the strings in
the current list start with a common sub-string - you can then generate
the equivalent of a patricia trie in regex form.  


> Be aware, you may run into that situation. Using a URI-dnsbl avoids
> that risk.

The list contains full URLs, I presume there's a reason for that. For
example:

http://invoiceholderqq.com/85.exe
http://invoiceholderqq.com/87.exe
http://invoiceholderqq.com/93.exe
http://inzt.net/08yhrf3
http://inzt.net/0ftce4

Reply via email to