Craig R Hughes writes: > Of course, maybe the original author could be convinced to offer an > Artistic license on his work, then the problem would magically go > away.
I'll ask. > I imagine this would probably happen more frequently in email than > in "normal" text, since emails tend to use abbreviations, weird > characters, shorthand, slang, etc. more than most document formats. Perhaps somewhat, but the list of possible matches is usually a very small subset out of the total list of languages and almost always includes the correct language. TextCat looks at common characters and letter combinations, so it doesn't need a whole lot of text. And if there's not enough, no matches are produced and you just skip the rule. $ echo "btw, how do you feel about lunch?" | text_cat breton or manx or welsh or scots or czech-iso8859_2 or english or middle_frisian > 1. What is the overhead of the language-analyzer? How fast does it > run over a typical message? It's high, so the rule might want to be optional (I would use it since I'm running as a single user), but I haven't done any optimization. I think the overhead could be reduced. For one thing, the language files have to be parsed and analyzed each run, it would be better done at install-time. SA could be a lot more efficient about how rules get run. For example, my new test adds a 3rd instance of get_decoded_stripped_body_text_array(). That can't be good. 20 messages (this is on a laptop, folks) with rule: real 0m51.299s user 0m39.770s sys 0m1.450s 20 messages without rule: real 0m37.418s user 0m25.890s sys 0m1.430s > 2. What is the footprint in disk/memory consumption? Does it have > to load a dictionary per language in order to be able to ID those > languages? That could be a heavy load to add for many SA users. 308k of disk usage (du -sk) for the language files. As is, it adds about 1MB to the memory usage of the Perl process. > 3. The tests you've done would be way more interesting with a more > international set of sample messages, and with ok_locales != 'en' and > ok_languages != "english". Any European volunteers to try this out on > their mailboxes? I'm guessing there's more multilingual mail and > probably less difference between languages there. I'm betting those > 17 messages are very much not English > (korean/chinese/russian/spanish?) and are more easily distinguished > than say French from Italian. Yes, but TextCat errs on the side of producing extra possible matches. Since we're using it to eliminate messages with a positive rule, there is no real issue. Now, a rule that said "only let exact matches" through would be a problem. My 17 messages are a small sample, but you can see how it works: chinese-big5 chinese-big5 chinese-big5 chinese-gb2312 chinese-big5 japanese-euc_jp chinese-gb2312 japanese-euc_jp chinese-big5 korean french indonesian turkish german slovenian-iso8859_2 scots danish korean korean korean korean korean korean chinese-big5 russian-koi8_r russian-windows1251 spanish spanish catalan Dan _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk