Steven D'Aprano wrote: > I have a large number of strings (originally file names) which tend to > fall into two groups. Some are human-meaningful, but not necessarily > dictionary words e.g.: > > > baby lions at play > saturday_morning12 > Fukushima > ImpossibleFork > > > (note that some use underscores, others spaces, and some CamelCase) while > others are completely meaningless (or mostly so): > > > xy39mGWbosjY > 9sjz7s8198ghwt > rz4sdko-28dbRW00u > > > Let's call the second group "random" and the first "non-random", without > getting bogged down into arguments about whether they are really random or > not. I wish to process the strings and automatically determine whether > each string is random or not. I need to split the strings into three > groups: > > - those that I'm confident are random > - those that I'm unsure about > - those that I'm confident are non-random > > Ideally, I'll get some sort of numeric score so I can tweak where the > boundaries fall. > > Strings are *mostly* ASCII but may include a few non-ASCII characters. > > Note that false positives (detecting a meaningful non-random string as > random) is worse for me than false negatives (miscategorising a random > string as non-random). > > Does anyone have any suggestions for how to do this? Preferably something > already existing. I have some thoughts and/or questions: > > - I think nltk has a "language detection" function, would that be > suitable? > > - If not nltk, are there are suitable language detection libraries? > > - Is this the sort of problem that neural networks are good at solving? > Anyone know a really good tutorial for neural networks in Python? > > - How about Bayesian filters, e.g. SpamBayes?
A dead simple approach -- look at the pairs in real words and calculate the ratio pairs-also-found-in-real-words/num-pairs $ cat score.py import sys WORDLIST = "/usr/share/dict/words" SAMPLE = """\ baby lions at play saturday_morning12 Fukushima ImpossibleFork xy39mGWbosjY 9sjz7s8198ghwt rz4sdko-28dbRW00u """.splitlines() def extract_pairs(text): for i in range(len(text)-1): yield text[i:i+2] def load_pairs(): pairs = set() with open(WORDLIST) as f: for line in f: pairs.update(extract_pairs(line.strip())) return pairs def get_score(text, popular_pairs): m = 0 for i, p in enumerate(extract_pairs(text), 1): if p in popular_pairs: m += 1 return m/i def main(): popular_pairs = load_pairs() for text in sys.argv[1:] or SAMPLE: score = get_score(text, popular_pairs) print("%4.2f %s" % (score, text)) if __name__ == "__main__": main() $ python3 score.py 0.65 baby lions at play 0.76 saturday_morning12 1.00 Fukushima 0.92 ImpossibleFork 0.36 xy39mGWbosjY 0.31 9sjz7s8198ghwt 0.31 rz4sdko-28dbRW00u However: $ python3 -c 'import random, sys; a = list(sys.argv[1]); random.shuffle(a); print("".join(a))' 'baby lions at play' bnsip atl ayba loy $ python3 score.py 'bnsip atl ayba loy' 0.65 bnsip atl ayba loy -- https://mail.python.org/mailman/listinfo/python-list