Am 21.12.15 um 11:36 schrieb Steven D'Aprano:
On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote:
Apfelkiste:Tests chris$ python score_my.py
-8.74 baby lions at play
-7.63 saturday_morning12
-6.38 Fukushima
-5.72 ImpossibleFork
-10.6 xy39mGWbosjY
-12.9 9sjz7s8198ghwt
-12.1 rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-9.43 bnsip atl ayba loy
Thanks Christian and Peter for the suggestion, I'll certainly investigate
this further.
But the scoring doesn't seem very good. "baby lions at play" is 100% English
words, and ought to have a radically different score from (say)
xy39mGWbosjY which is extremely non-English like. (How many English words
do you know of with W, X, two Y, and J?) And yet they are only two units
apart. "baby lions..." is a score almost as negative as the authentic
gibberish, while Fukushima (a Japanese word) has a much less negative
score.
It is the spaces, which do not occur in the training wordlist (I
mentioned that above, maybe not prominently enough).
/usr/share/dict/words contains one word per line. The underscore _ is
probably putting the saturday morning low, while the spaces put the
babies low. Using trigraphs:
Apfelkiste:Tests chris$ python score_my.py
-11.5 baby lions at play
-9.88 saturday_morning12
-9.85 Fukushima
-7.68 ImpossibleFork
-13.4 xy39mGWbosjY
-14.2 9sjz7s8198ghwt
-14.2 rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'babylionsatplay'
-8.74 babylionsatplay
Apfelkiste:Tests chris$ python score_my.py 'saturdaymorning12'
-8.93 saturdaymorning12
Apfelkiste:Tests chris$
So for the spaces, either use a proper trainig material (some long
corpus from Wikipedia or such), with punctuation removed. Then it will
catch the correct probabilities at word boundaries. Or preprocess by
removing the spaces.
Christian
--
https://mail.python.org/mailman/listinfo/python-list