Craig R Hughes writes:

> Of course, maybe the original author could be convinced to offer an
> Artistic license on his work, then the problem would magically go
> away.

I'll ask.

> I imagine this would probably happen more frequently in email than
> in "normal" text, since emails tend to use abbreviations, weird
> characters, shorthand, slang, etc. more than most document formats.

Perhaps somewhat, but the list of possible matches is usually a very
small subset out of the total list of languages and almost always
includes the correct language.  TextCat looks at common characters and
letter combinations, so it doesn't need a whole lot of text.  And if
there's not enough, no matches are produced and you just skip the rule.

  $ echo "btw, how do you feel about lunch?" | text_cat
  breton or manx or welsh or scots or czech-iso8859_2 or english or middle_frisian

> 1. What is the overhead of the language-analyzer?  How fast does it
> run over a typical message?

It's high, so the rule might want to be optional (I would use it since
I'm running as a single user), but I haven't done any optimization.  I
think the overhead could be reduced.  For one thing, the language files
have to be parsed and analyzed each run, it would be better done at
install-time.

SA could be a lot more efficient about how rules get run.  For example, my
new test adds a 3rd instance of get_decoded_stripped_body_text_array().
That can't be good.

20 messages (this is on a laptop, folks) with rule:

real    0m51.299s
user    0m39.770s
sys     0m1.450s

20 messages without rule:

real    0m37.418s
user    0m25.890s
sys     0m1.430s

> 2. What is the footprint in disk/memory consumption?  Does it have
> to load a dictionary per language in order to be able to ID those
> languages?  That could be a heavy load to add for many SA users.

308k of disk usage (du -sk) for the language files.  As is, it adds
about 1MB to the memory usage of the Perl process.

> 3. The tests you've done would be way more interesting with a more
> international set of sample messages, and with ok_locales != 'en' and
> ok_languages != "english".  Any European volunteers to try this out on
> their mailboxes?  I'm guessing there's more multilingual mail and
> probably less difference between languages there.  I'm betting those
> 17 messages are very much not English
> (korean/chinese/russian/spanish?) and are more easily distinguished
> than say French from Italian.

Yes, but TextCat errs on the side of producing extra possible matches.
Since we're using it to eliminate messages with a positive rule, there
is no real issue.  Now, a rule that said "only let exact matches"
through would be a problem.

My 17 messages are a small sample, but you can see how it works:

  chinese-big5
  chinese-big5
  chinese-big5
  chinese-gb2312 chinese-big5 japanese-euc_jp
  chinese-gb2312 japanese-euc_jp chinese-big5 korean
  french
  indonesian turkish german slovenian-iso8859_2 scots danish
  korean
  korean
  korean
  korean
  korean
  korean chinese-big5
  russian-koi8_r
  russian-windows1251
  spanish
  spanish catalan

Dan

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to