I have a large number of strings (originally file names) which tend to fall into two groups. Some are human-meaningful, but not necessarily dictionary words e.g.:
baby lions at play saturday_morning12 Fukushima ImpossibleFork (note that some use underscores, others spaces, and some CamelCase) while others are completely meaningless (or mostly so): xy39mGWbosjY 9sjz7s8198ghwt rz4sdko-28dbRW00u Let's call the second group "random" and the first "non-random", without getting bogged down into arguments about whether they are really random or not. I wish to process the strings and automatically determine whether each string is random or not. I need to split the strings into three groups: - those that I'm confident are random - those that I'm unsure about - those that I'm confident are non-random Ideally, I'll get some sort of numeric score so I can tweak where the boundaries fall. Strings are *mostly* ASCII but may include a few non-ASCII characters. Note that false positives (detecting a meaningful non-random string as random) is worse for me than false negatives (miscategorising a random string as non-random). Does anyone have any suggestions for how to do this? Preferably something already existing. I have some thoughts and/or questions: - I think nltk has a "language detection" function, would that be suitable? - If not nltk, are there are suitable language detection libraries? - Is this the sort of problem that neural networks are good at solving? Anyone know a really good tutorial for neural networks in Python? - How about Bayesian filters, e.g. SpamBayes? -- Steven -- https://mail.python.org/mailman/listinfo/python-list