Hi, I am trying to use spamassasin tokenization result on some other machine learning methods, such as SVM, etc. The results from "sa-learn --dump" are token frequency in all ham or spam messages, and not on a per-message basis. The token counts I want is like the following format:
Tokens msg0 msg1 ... msgM token1 10 6 ... 0 ...... tokenN 20 1 ... 2 If the data on a per-message basis is not available in current design, is there any ways to use spamassasin to do the tokenization only, then use my own statistical model for the classification? Thanks,Qian