On 21/05/2007 11:46 PM, brad wrote: > I am developing a list of 3 character strings like this: > > and > bra > cam > dom > emi > mar > smi > ... > > The goal of the list is to have enough strings to identify files that > may contain the names of people. Missing a name in a file is unacceptable.
The constraint that you have been given (no false negatives) is utterly unrealistic. Given that constraint, forget the 3-letter substring approach. There are many two-letter names. I have seen a genuine instance of a one-letter surname ("O"). In jurisdictions which don't disallow it, people can change their name to a string of digits. These days you can't even rely on names starting with a capital letter ("i think paris hilton is <adjective> do u 2"). > > For example, the string 'mar' would get marc, mark, mary, maria... 'smi' > would get smith, smiley, smit, etc. False positives are OK (getting > common words instead of people's names is OK). > > I may end up with a thousand or so of these 3 character strings. If you get a large file of names and take every possible 3-letter substring that you find, you would expect to get well over a thousand. > Is that > too much for an re.compile to handle? Suck it and see. I'd guess that re.compile("mar|smi|jon|bro|wil....) is *NOT* the way to go. > Also, is this a bad way to > approach this problem? Yes. At the very least I'd suggest that you need to break up your file into "words" and then consider whether each word is part of a "name". Much depends on context, if you want to cut down on false positives -- "we went 2 paris n staid at the hilton", "the bill from the smith was too high". > Any ideas for improvement are welcome! 1. Get the PHB to come up with a more realistic constraint. 2. http://en.wikipedia.org/wiki/Named_entity_recognition HTH, John -- http://mail.python.org/mailman/listinfo/python-list