Nice stuff - have you got any benchmarks to prove it's all worthwhile? Cheers,
Phil --------------------------------------------- Phil Randal Network Engineer Herefordshire Council Hereford, UK > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] > Behalf Of Scott > A Crosby > Sent: 20 January 2004 05:50 > To: [EMAIL PROTECTED]; Chris Santerre > Subject: [SAtalk] Matching a list of strings quickly. > > > A few weeks ago I described a technique to automatically convert a > list of strings into a factored regexp for faster matching. > > You know, from > > foobat > foobang > fooziit > > to > > foo(bat|bang|ziit) > > Well, I've got a prototype complete and available here: > > http://www.cs.rice.edu/~scrosby/datamining/src/prefixStringFactor/ > > Binary is for linux x86. I'll put source up eventually. > > Pass it a bunch of ordinary strings on successive lines as input, and > each line of output is a seperate rule. You don't want to use escaped > strings or prefixes and suffixes like the test file shown below, but > its what I had. If you're matching URL's, I suggest folding the URL > list to lowercase first, and using case-insensitive matching. > > Its fully automatic and fairly sophisticated though it will look silly > on small files. I don't implement right-factoring or greedy left > factoring yet. > > For instance: > > /zrowlandtzq\.com/i > /zsoftech\.net/i > /zsupper\.com/i > /zui6av\.net/i > /zunoz\.com/i > /zuon6\.net/i > /zvg3gc\.org/i > /zwdsj\.org/i > /zworg\.com/i > /zzitq5\.net/i > > > TO > > /ze(roads\.com/i|dnet\.net/i|sty\.ws/i|belkhan\.com/i|nitzenit > \.com/i|n1ado\.com/i|nmail2003\.com/i) > /za(irmail\.com/i|ushon\.com/i|xouts\.com/i|meq\.org/i|karish\ > .com/i|qxsw\.biz/i) > /zo(ontzq\.com/i|rromail\.com/i|anmail\.com/i|mnieb\.com/i|ne- > net\.net/i|ningfor-best\.com/i) > /zi(04\.com/i|m-crozer\.net/i|p-media\.com/i|yuantzq\.com/i|bx > r\.com/i) > /z(worg\.com/i|wdsj\.org/i|hupong\.com/i|hangxiaoping\.com/i|h > angnian\.com/i|vg3gc\.org/i|unoz\.com/i|uon6\.net/i|ui6av\.net > /i|supper\.com/i| > softech\.net/i|dl\.net/i|7wmcsp\.com/i) > /z(rowlandtzq\.com/i|re9iq\.net/i|ckzh\.net/i|qlp\.com/i|q89\. > org/i|bestoffer\.com/i|ppi\.org/i|3i26up\.org/i|n8px\.com/i|no > lt\.net/i|ncvma\. > org/i|2p\.net/i|mqp\.net/i|m01\.net/i|kpc\.net/i|khatritzq\.co > m/i|zitq5\.net/i|jzm\.net/i|jwju\.org/i|jfe\.com/i) > /yu(f7b89\.com/i|ictme1s2g5jph\.org/i|78hg\.com/i|aln38\.org/i > |noz\.biz/i) > /ye(6tj\.com/i|llowtang\.net/i|ah\.net/i|arendsaver\.com/i|sma > il\.com/i|smail\.net/i|ez\.org/i) > /youn(gfaster\.biz/i|gforever22\.com/i|gandhorny\.us/i|gandthi > n\.biz/i|gpinkpussies\.com/i|gerfasternow\.biz/i) > /yourf(avoritepresent\.com/i|avoritestuff\.com/i|reelunch\.com > /i|reepresent\.com/i|reevitamins\.com/i) > /yourd(omain\.biz/i|omain\.com/i|vdrentalstore\.com/i|ebt\.com/i) > /yourb(ig\.com/i|igfun\.com/i|izinformation\.com/i|randsdirect > \.net/i|argainbuddy\.com/i|estsavings\.com/i) > /yourm(ailsource\.com/i|arketnews\.com/i|edicinecabinet\.biz/i > |eds\.biz/i|edstore\.us/i) > > > ------------------------------------------------------- > The SF.Net email is sponsored by EclipseCon 2004 > Premiere Conference on Open Tools Development and Integration > See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. > http://www.eclipsecon.org/osdn > _______________________________________________ > Spamassassin-talk mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/spamassassin-talk > ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk