Hello, (scroll to PYTHON TEST for a test)
I've took another look to the Bayesian filter (it was not my "task" :-) but it's my pleasure). Ok, to start, Reverend tokenizes the training texts and works only on token level, not sub-token level. So we should not expect that will detect c0mputer as computer (quite common mistake yesterday, I think) (I was doing a high-level mathematical description, but I will postpone -for when I will check some more things- or just leave for the Pythoner who will do it for the next Meetup ;-) ) PYTHON TEST car...@pinux:~/bayes$ ls training/ bash c++ python Each of these directories contains between 18 and 29 files that I've copied randomly from different places of my hard disk. Then I have: car...@pinux:~/bayes$ ls guessing/ demanar.py keymap.sh medium.py qdacco.cpp car...@pinux:~/bayes$ some other files that I've copied there... The Bayesian filter never knows the name of the file. Just using this set for training, look the results: ----- Start test ./guessing/qdacco.cpp [('c++', 0.6590693537529797), ('python', 0.59287521198182513), ('bash', 0.28091954259046653)] ./guessing/demanar.py [('python', 0.58882188718297557), ('c++', 0.57869106382644175), ('bash', 0.36380374534210203)] ./guessing/keymap.sh [('bash', 0.54270073170250122), ('c++', 0.47142124856042872), ('python', 0.36321294599284148)] ./guessing/main.py [('python', 0.65909707358336711), ('c++', 0.52731742496139433), ('bash', 0.3261511618248264)] I consider it quite good. bayes.py is 30 lines long -could be less- and it works pretty well, even having only parts of the program (don't tell me to check for #include , #!/bin/bash or #!/usr/bin/python, not needed at all, works with snippets of code, etc.) Yes, there is one case that guess that it's Pythonn and not far from c++. I probably need a bigger data set, but even then if it guess it "quite well" then is "quite good" :) (I'm thinking, for example, in some service like pastebin, that would guess that the code that you are copy-pasting there, and if you change the guess, it can train itself with the new code). My training sets are very noisy, and I should subclass Reverend and improve the tokenizer to use a a separator "=", "(", ")" and other things, since now a line like: linia=random.randint(1,float(total_paraules)) It's one token... The literals should be probably removed as well. I'm taking a look to the statistics part. Here is a good start: http://en.wikipedia.org/wiki/Naive_Bayes_classifier Cheers, -- Carles Pina i Estany http://pinux.info _______________________________________________ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk