On 3 Aug 2016, at 6:07, Ruga wrote:
Hello,
We received a new type of spam, twice, and we are not willing to give
them a third chance.
The body includes a long html paragraph (<p>...</p>) of headlines from
the news.
The following works at the command line:
perl -p0e 's/(<p>(?:(?!<\/p>).){999,}<\/p>)/-->$1<--/msig' example.eml
perl -n0e '/(<p>(?:(?!<\/p>).){999,}<\/p>)/msig and print
"--->$1<---"' example.eml
The following SA rule, however, does not work at all:
rawbody __B_PLL /<p>(?:(?!<\/p>).){999,}<\/p>/msi
Will not hit an unclosed <p> tag.
May take a very long time to check against some very long messages with
pathological (but not uncommon or spam-only!) HTML, due to the
open-ended {999,}.
However, I happen to have a message that should match your rule. It
matches in perl directly, with the matched string being a 163301-byte
HTML mess which also happens to be one line when decoded from
quoted-printable, but it does not match as a SA 'rawbody' pattern. It
DOES match if the rule is switched from 'rawbody' to 'full'. It is not
clear to me why that change results in a match. I also constructed a
message where the '</p>' was only 650 characters after the '<p>' and
reduced the minimum length from 999 to 300, and that also matched as a
rawbody rule.
It seems that there's something breaking in SA when a 'rawbody' match is
too long. I suspect a logical problem in how SA "chunks" a message for
body and rawbody tests, but I haven't tracked down the details... Looks
like a SA bug to me.
For performance, the ability to match unclosed paragraphs, and working
around that bug,a better solution is:
rawbody __B_PLL /<p>(?:(?!<\/p>).){999}/msi
That will match the first 999 characters of a long HTML paragraph, no
matter how long it is and whether or not it is ever closed. It also gets
around whatever SA bug is blocking the very long matches.
tflags __B_PLL multiple maxhits=1
Pointless. Why set the "multiple" flag if you're going to set
"maxhits=1"???
meta B_PLL __B_PLL
describe B_PLL Body: Paragraph Length Limit
score B_PLL 1.0
I assume this is a placeholder for future combination with other rules,
since the 'meta' as it stands is pointless and simply having a paragraph
longer than 999 characters isn't inherently or heuristically spammy.