Question - your from doens't match your to in the final example - right?

Good idea - as a general purpose tool such a concept could help condense
patterns very efficiently. If you shared your source some other people might
have improvements to make though (not me - I'm a reg ex whimp ;-)

Can appreciate the ease of effort though.

Very cool.

m/

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Scott
> A Crosby
> Sent: Monday, January 19, 2004 9:50 PM
> To: [EMAIL PROTECTED]; Chris Santerre
> Subject: [SAtalk] Matching a list of strings quickly.
>
>
> A few weeks ago I described a technique to automatically convert a
> list of strings into a factored regexp for faster matching.
>
> You know, from
>
>   foobat
>   foobang
>   fooziit
>
> to
>
>   foo(bat|bang|ziit)
>
> Well, I've got a prototype complete and available here:
>
>    http://www.cs.rice.edu/~scrosby/datamining/src/prefixStringFactor/
>
> Binary is for linux x86. I'll put source up eventually.
>
> Pass it a bunch of ordinary strings on successive lines as input, and
> each line of output is a seperate rule. You don't want to use escaped
> strings or prefixes and suffixes like the test file shown below, but
> its what I had. If you're matching URL's, I suggest folding the URL
> list to lowercase first, and using case-insensitive matching.
>
> Its fully automatic and fairly sophisticated though it will look silly
> on small files. I don't implement right-factoring or greedy left
> factoring yet.
>
> For instance:
>
> /zrowlandtzq\.com/i
> /zsoftech\.net/i
> /zsupper\.com/i
> /zui6av\.net/i
> /zunoz\.com/i
> /zuon6\.net/i
> /zvg3gc\.org/i
> /zwdsj\.org/i
> /zworg\.com/i
> /zzitq5\.net/i
>
>
> TO
>
> /ze(roads\.com/i|dnet\.net/i|sty\.ws/i|belkhan\.com/i|nitzenit\.co
> m/i|n1ado\.com/i|nmail2003\.com/i)
> /za(irmail\.com/i|ushon\.com/i|xouts\.com/i|meq\.org/i|karish\.com
> /i|qxsw\.biz/i)
> /zo(ontzq\.com/i|rromail\.com/i|anmail\.com/i|mnieb\.com/i|ne-net\
> .net/i|ningfor-best\.com/i)
> /zi(04\.com/i|m-crozer\.net/i|p-media\.com/i|yuantzq\.com/i|bxr\.com/i)
> /z(worg\.com/i|wdsj\.org/i|hupong\.com/i|hangxiaoping\.com/i|hangn
> ian\.com/i|vg3gc\.org/i|unoz\.com/i|uon6\.net/i|ui6av\.net/i|suppe
> r\.com/i|
> softech\.net/i|dl\.net/i|7wmcsp\.com/i)
> /z(rowlandtzq\.com/i|re9iq\.net/i|ckzh\.net/i|qlp\.com/i|q89\.org/
> i|bestoffer\.com/i|ppi\.org/i|3i26up\.org/i|n8px\.com/i|nolt\.net/
> i|ncvma\.
> org/i|2p\.net/i|mqp\.net/i|m01\.net/i|kpc\.net/i|khatritzq\.com/i|
> zitq5\.net/i|jzm\.net/i|jwju\.org/i|jfe\.com/i)
> /yu(f7b89\.com/i|ictme1s2g5jph\.org/i|78hg\.com/i|aln38\.org/i|noz\.biz/i)
> /ye(6tj\.com/i|llowtang\.net/i|ah\.net/i|arendsaver\.com/i|smail\.
> com/i|smail\.net/i|ez\.org/i)
> /youn(gfaster\.biz/i|gforever22\.com/i|gandhorny\.us/i|gandthin\.b
> iz/i|gpinkpussies\.com/i|gerfasternow\.biz/i)
> /yourf(avoritepresent\.com/i|avoritestuff\.com/i|reelunch\.com/i|r
> eepresent\.com/i|reevitamins\.com/i)
> /yourd(omain\.biz/i|omain\.com/i|vdrentalstore\.com/i|ebt\.com/i)
> /yourb(ig\.com/i|igfun\.com/i|izinformation\.com/i|randsdirect\.ne
> t/i|argainbuddy\.com/i|estsavings\.com/i)
> /yourm(ailsource\.com/i|arketnews\.com/i|edicinecabinet\.biz/i|eds
> \.biz/i|edstore\.us/i)
>
>
> -------------------------------------------------------
> The SF.Net email is sponsored by EclipseCon 2004
> Premiere Conference on Open Tools Development and Integration
> See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
> http://www.eclipsecon.org/osdn
> _______________________________________________
> Spamassassin-talk mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
>



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to