[SAtalk] Bayes learning on double-byte Email?

Genchev, Sergei Thu, 19 Jun 2003 15:57:23 -0700

 Hi everyone,

 Our company gets a lot of legitimate and not-so-legitimate E-mail in
Chinese. Our people in Taiwan have quite a bit more spam slip through than
our US and European offices.
 Having read a lot of warnings about using UTF-8 locale, I am running SA
2.55 with LANG=en_US on RH8.
 Does it make any sense to feed Chinese (mostly HTML) E-mail as spam/ham to
bayes? Would Bayes learn Chinese words as meaningless single-byte "words"?
Does it matter? Should I try to use UTF-8 locale? Any experiences would be
greatly appreciated, especially from mail admins in double-byte-speaking
offices.


 Related - I think - Bayes question: If E-mail body is HTML, does sa-learn
use "body" or "rawbody" when scoring words?

 Thank you very much,

Sergei Genchev


            ------------------------- 
This e-mail and any attachments may contain confidential material for the sole use of 
the intended recipient.  If you are not the intended recipient, please be aware that 
any disclosure, copying, distribution or use of this e-mail or any attachment is 
prohibited.  If you have received this e-mail in error, please contact the sender and 
delete all copies. 
Thank you for your cooperation 



-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Bayes learning on double-byte Email?

Reply via email to