Thanks again for the explanation. Looking forward to a future release when soft-hyphens (and additional control characters?) are essentially ignored.
On Wed, Sep 7, 2022 at 9:14 AM Thomas Eckardt <thomas.ecka...@thockar.com> wrote: > If unicode normalization NFKC does'nt fulfill your requirement, you may > enable 'DoTransliterate' - by accepting some performance penalties. > > The "Unicode Technical Standard #39" http://www.unicode.org/reports/tr39/ > will give you some more information and > https://www.unicode.org/Public/security/revision-05/intentional.txt shows > a nice table for cyrillic and greek. > If someone expects an ASCII mail, those translations may somehow help. But > in all other cases (100% cyrillic/greek/....), such a character replacement > is contra-productive (for example: not all cyrillic letters have a valid > latin replacement). > > > potentially treat look-alike characters as the latin character for > bayesian purposes > > The HMM and Bayesian engines are using heuristic mechanism. Trying to > treat single characters as latin (or anything else) will not worth the > effort. Over a short periode of time, both engines will have learned also > obscured words (word combinations). > > > Thomas > > > > > Von: "K Post" <nntp.p...@gmail.com> > An: "ASSP development mailing list" < > assp-test@lists.sourceforge.net> > Datum: 06.09.2022 21:31 > Betreff: Re: [Assp-test] soft hyphen fooling Bayesian analysis > ------------------------------ > > > > Eager to see what you come up with in terms of ignoring the soft hyphen. > > Your <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> regex is clear, and I > understand using that for scoring purposes, but I'm looking for a way to > potentially treat look-alike characters as the latin character for bayesian > purposes and/or to catch commonly obscured words (like GeekSquad). Is it > okay if I reply further in my August 1 post here to keep that in the same > thread? > > On Tue, Sep 6, 2022 at 2:06 PM Thomas Eckardt < > *thomas.ecka...@thockar.com* <thomas.ecka...@thockar.com>> wrote: > >HTML::strip > > html parsing to get text parts has nothing to do with html de(en)coding > > > >iso-8559-1 > ASSP processes all content as UTF-8 > > > >­ > ASSP is aware about this - and replaces soft-hyphens with hard-hyphens - > and multiple concurrent hard-hyphens with a single one > How ever - the option to remove the soft-hyphens instead, sounds somehow > better. Tests are still running. > > >My thinking is that if it doesn't display..... > ASSP does'nt know if something displayed or not (and will never know it) > > > >I suspect that other characters will be abused in the same way > as well as several BIG5, numerical and other unicode characters are > already special handled by assp. Other CTL-chars are ignored by assp. > Everything is converted to UTF8, unicode normalized (including grapheme > clusters), stemmed and simplyfied. > > > >This kind of obfuscation goes hand in hand with my previous questions > about considering some non-Latin characters that look like Latin characters > as those Latin alphabet characters. > > With some unicode knowledge, some help from the analyzer and some regex > knowledge - such things are easy to find > for example : <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> > finds a sequence where cyrillic (a p b ....) are used in words - commonly > used by spammers > > Thomas > > > > Von: "K Post" <*nntp.p...@gmail.com* <nntp.p...@gmail.com>> > An: "ASSP development mailing list" < > *assp-test@lists.sourceforge.net* <assp-test@lists.sourceforge.net>> > Datum: 06.09.2022 16:16 > Betreff: [Assp-test] soft hyphen fooling Bayesian analysis > ------------------------------ > > > > > Is there a way to improve the way that ASSP parses certain special, > non-printing, characters? I'm having trouble with spam emails that have > their body heavily obfuscated with "soft hyphens" slipping through. They > all seem to have multipart bodies, first with an iso-8559-1 text part with > *=AD* interterspersed in words and then an html part with *­* all > over the place. These are the "soft hyphen," a hyphen that only prints if > it is needed to break the word to the next line. It's clever. The user > doesn't see the character, but ASSP thinks it's a word boundary. > > The part first part > Content-Type: text/plain; charset="*iso-8859-1*" > Content-Transfer-Encoding: quoted-printable > will be plain text, and have have spammy words with *=AD* inserted in the > middle of them, for example, "This is a sentence with spammy phrase." could > be written something like > This is a sentence with sp=ADammy p=ADhr=ADase. > > The next mime part is the html, which does the same thing, but uses ­ > (html for soft hyphen) mid-word. So, something like: > <p>This is a sentence with sp­ammy p­hr­ase in it</p> > > The whole body of the message is filled with these soft hyphens anywhere > that there's spammy words/phrases, and in many cases, there are soft > hyphens every couple of letters across the entire body. When I do an > analysis, it appears that the soft hyphen tricks ASSP into thinking that > each part of the word is a separate word, so for sp­ammy > p­hr­ase, it thinks the words are > sp ammy p hr ase > > I am using HTML::strip. Would TreeBuilder work better? I'm concerned > about performance there. > > Is there a way (and is it a good idea) to somehow instruct ASSP to treat > certain html special characters as ones to ignore, and others to be treated > as a word separator? My thinking is that if it doesn't display, then it > should be ignored when doing bayesian / HMM evaluation. > > *https://cs.stanford.edu/people/miles/iso8859.html* > <https://cs.stanford.edu/people/miles/iso8859.html> has a bunch of > Control Characters and Special Characters that don't print - or in the case > of the soft hyphen, only print when the contained word is at the end of a > line. I suspect that other characters will be abused in the same way. > > This kind of obfuscation goes hand in hand with my previous questions > about considering some non-Latin characters that look like Latin characters > as those Latin alphabet characters. > > Thanks > > > > > > [Anhang "attz351u.txt" gelöscht von Thomas Eckardt/eck] [Anhang > "att8gq15.txt" gelöscht von Thomas Eckardt/eck] > > _______________________________________________ > Assp-test mailing list > *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net> > *https://lists.sourceforge.net/lists/listinfo/assp-test* > <https://lists.sourceforge.net/lists/listinfo/assp-test>*[Anhang > "att8rbj5.txt" gelöscht von Thomas Eckardt/eck] [Anhang "atthrsos.txt" > gelöscht von Thomas Eckardt/eck] * > > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test >
_______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test