Am 21.12.15 um 09:24 schrieb Peter Otten:
Steven D'Aprano wrote:
I have a large number of strings (originally file names) which tend to
fall into two groups. Some are human-meaningful, but not necessarily
dictionary words e.g.:
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
(note that some use underscores, others spaces, and some CamelCase) while
others are completely meaningless (or mostly so):
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
Let's call the second group "random" and the first "non-random", without
getting bogged down into arguments about whether they are really random or
not. I wish to process the strings and automatically determine whether
each string is random or not. I need to split the strings into three
groups:
- those that I'm confident are random
- those that I'm unsure about
- those that I'm confident are non-random
Ideally, I'll get some sort of numeric score so I can tweak where the
boundaries fall.
Strings are *mostly* ASCII but may include a few non-ASCII characters.
Note that false positives (detecting a meaningful non-random string as
random) is worse for me than false negatives (miscategorising a random
string as non-random).
Does anyone have any suggestions for how to do this? Preferably something
already existing. I have some thoughts and/or questions:
- I think nltk has a "language detection" function, would that be
suitable?
- If not nltk, are there are suitable language detection libraries?
- Is this the sort of problem that neural networks are good at solving?
Anyone know a really good tutorial for neural networks in Python?
- How about Bayesian filters, e.g. SpamBayes?
A dead simple approach -- look at the pairs in real words and calculate the
ratio
pairs-also-found-in-real-words/num-pairs
Sounds reasonable. Building on this approach, two simple improvements:
- calculate the log-likelihood instead, which also makes use of the
frequency of the digraphs in the training set
- Use trigraphs instead of digraphs
- preprocess the string (lowercase), but more sophisticated
preprocessing could be an option (i.e. converting under_scores and
CamelCase to spaces)
The main reason for the low score of the baby lions is the space
character, I think - the word list does not contain that much spaces.
Maybe one should feed in some long wikipedia article to calculate the
digraph/trigraph probabilities
=====================================
Apfelkiste:Tests chris$ cat score_my.py
from __future__ import division
from collections import Counter, defaultdict
from math import log
import sys
WORDLIST = "/usr/share/dict/words"
SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()
def extract_pairs(text):
for i in range(len(text)-1):
yield text.lower()[i:i+2]
# or len(text)-2 and i:i+3
def load_pairs():
pairs = Counter()
with open(WORDLIST) as f:
for line in f:
pairs.update(extract_pairs(line.strip()))
# normalize to sum
total_count = sum([pairs[x] for x in pairs])
N = total_count+len(pairs)
dist = defaultdict(lambda:1/N, ((x, (pairs[x]+1)/N) for x in pairs))
return dist
def get_score(text, dist):
ll = 0
for i, x in enumerate(extract_pairs(text), 1):
ll += log(dist[x])
return ll / i
def main():
pair_dist = load_pairs()
for text in sys.argv[1:] or SAMPLE:
score = get_score(text, pair_dist)
print("%.3g %s" % (score, text))
if __name__ == "__main__":
main()
Apfelkiste:Tests chris$ python score_my.py
-8.74 baby lions at play
-7.63 saturday_morning12
-6.38 Fukushima
-5.72 ImpossibleFork
-10.6 xy39mGWbosjY
-12.9 9sjz7s8198ghwt
-12.1 rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-9.43 bnsip atl ayba loy
Apfelkiste:Tests chris$
and using trigraphs:
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-12.5 bnsip atl ayba loy
Apfelkiste:Tests chris$ python score_my.py
-11.5 baby lions at play
-9.88 saturday_morning12
-9.85 Fukushima
-7.68 ImpossibleFork
-13.4 xy39mGWbosjY
-14.2 9sjz7s8198ghwt
-14.2 rz4sdko-28dbRW00u
==============================
--
https://mail.python.org/mailman/listinfo/python-list