On 8 Dec 2004 15:39:15 -0800, Lonnie Princehouse <[EMAIL PROTECTED]> wrote: > Regular expressions. > > It takes a while to craft the expressions, but this will be more > elegant, more extensible, and considerably faster to compute (matching > compiled re's is fast).
I think that this problem is probably a little bit harder. As the OP noted, each ISP uses a different notation. I think that a better solution is to use a statistical approach, possibly using a custom Bayesian filter that could "learn" a little bit about some patters. The basic idea is as follows: -- break the URL in pieces, using not only the dots, but also hyphens and underscores in the name. -- classify each part, using REs to identify common patterns: frequent strings (com, gov, net, org); normal words (sequences of letters); normal numbers; combinations of numbers & letters; common substrings can also be identified (such as isp, in the middle of one of the strings). -- check these pieces against the Bayesian filter, pretty much as it's done for spam. I think that this approach is promising. It relies on the fact that real servers usually do not have numbers in their names; however, exact identification either by a match or a regular expression is very difficult. I'm willing to try it, but first, more data is needed. -- Carlos Ribeiro Consultoria em Projetos blog: http://rascunhosrotos.blogspot.com blog: http://pythonnotes.blogspot.com mail: [EMAIL PROTECTED] mail: [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list