I am not familiar with MySQL regular expressions. The regex I provided uses Perl syntax. It doesn't use lookarounds.

I can think of two things offhand that might be an issue.

1) Maybe it doesn't like the non-capturing groups.  (?: .... )

Try it without the "?:" on the two groups:
\b((FedEx|Shipment|702193383246|Notification)\b.*?){3}

2) Maybe it doesn't like the non-greedy repetition.   *?

Try this:
\b((FedEx|Shipment|702193383246|Notification)\b.*){3}

If neither of those work, try asking on a MySQL mailing list where they should be more familiar with the syntax.

Keep in mind that this single regex will not do everything you want. It is perfectly capable of matching the string "FedEx FedEx FedEx". You need additional processing to make sure you are matching on 3 unique strings and not just duplicates of one or two.

You may also want to make the regex case-insensitive. In Perl, this is done with the "i" modifier ( /regex/i ), I have no idea how to do it with MySQL.

Bowie

On 9/28/2016 12:28 PM, Nicola Piazzi wrote:

This is what i need Bowie

The query must be

select from_address, from_domain, to_address, subject from maillog where subject REGEXP '\b(?:(?:FedEx|Shipment|702193383246|Notification)\b.*?){3}';

But unfortunately mysql give error

ERROR 1139 (42000): Got error 'repetition-operator operand invalid' from regexp

MySQL regular expressions don't have lookarounds

Nicola Piazzi
CED - Sistemi
COMET s.p.a.
Via Michelino, 105 - 40127 Bologna – Italia
Tel.  +39 051.6079.293
Cell. +39 328.21.73.470
Web: www.gruppocomet.it <http://www.gruppocomet.it/>
Descrizione: gc

*Da:*Bowie Bailey [mailto:bowie_bai...@buc.com]
*Inviato:* mercoledì 28 settembre 2016 17:46
*A:* users@spamassassin.apache.org
*Oggetto:* Re: R: R: R: regular expression needed

I don't know of a way to do that with a simple regex. But since you are writing a plugin, you could do it by parsing the output of a regex search.

1) Create a regex which will match on any combination of 3 of the words. This will let you pull all of the possible matches from previous emails.
Something like this: /\b(?:(?:word1|word2|word3|word4)\b.*?){3}/

2) For each of the lines found by the previous regex, run another regex that captures all matched words. /\b(word1|word2|word3|word4)\b/g (note the global modifier to catch all matches)

3) Take a look at the results for each line and see if the regex matched at least 3 unique words.

I'm quite sure that this is not the most efficient method, but it should work.

Bowie

On 9/28/2016 11:20 AM, Nicola Piazzi wrote:

    Obviously i intended to write a plugin that search the db

    But I need the regex syntax to search at least 3 words that match
    of 4 words given

    Nicola Piazzi
    CED - Sistemi
    COMET s.p.a.
    Via Michelino, 105 - 40127 Bologna – Italia
    Tel.  +39 051.6079.293
    Cell. +39 328.21.73.470
    Web: www.gruppocomet.it <http://www.gruppocomet.it/>
    Descrizione: gc

    *Da:*Bowie Bailey [mailto:bowie_bai...@buc.com]
    *Inviato:* mercoledì 28 settembre 2016 17:17
    *A:* Nicola Piazzi <nicola.pia...@gruppocomet.it>
    <mailto:nicola.pia...@gruppocomet.it>; Spamassassin List
    <users@spamassassin.apache.org> <mailto:users@spamassassin.apache.org>
    *Oggetto:* Re: R: R: regular expression needed

    Please keep list emails on the list.

    I don't think you could do a simple regex match for what you
    want.  As I said previously, this would require a plugin both to
    build the custom regex(s) (or DB query) and to search for the
    previous emails.  You would want to keep the prior email
    information in a database of some sort since doing a search of a
    large text file for every incoming email would probably be too slow.

    Bowie

    On 9/28/2016 10:05 AM, Nicola Piazzi wrote:

        Flux :

        I receive an email with subject “Federal Express Important
        invoice number 20”

        Plugin search a regex in maillog database for 10 days ago
        mails and this regex search match 1 or more lines

        So we know that similar mails received in the past

        But it is normal to receive similar text but not so normal to
        receive same subject from different addresses directed to
        different internal users

        Nicola Piazzi
        CED - Sistemi
        COMET s.p.a.
        Via Michelino, 105 - 40127 Bologna – Italia
        Tel.  +39 051.6079.293
        Cell. +39 328.21.73.470
        Web: www.gruppocomet.it <http://www.gruppocomet.it/>
        Descrizione: gc

        *Da:*Bowie Bailey [mailto:bowie_bai...@buc.com]
        *Inviato:* mercoledì 28 settembre 2016 16:01
        *A:* users@spamassassin.apache.org
        <mailto:users@spamassassin.apache.org>
        *Oggetto:* Re: R: regular expression needed

        I'm still not clear on exactly what you are trying to do, but
        in order to test anything against previous messages, you will
        need a custom SA plugin and some sort of database to store the
        information about previous emails.  That is beyond my area of
        expertise.

        If you just need a regex to match something, I'd be happy to
        help, but I would need a more explicit description of what you
        are trying to match.

        Bowie

        On 9/28/2016 9:29 AM, Nicola Piazzi wrote:

            Bowie, your ia a manual way, it works but is not automated

            Automation is a plugin that check similar words in oldest
            messages (for example 3 of 4 words match)

            Then plugin check if sender domain is different and
            recipient is different

            *Da:*Bowie Bailey [mailto:bowie_bai...@buc.com]
            *Inviato:* mercoledì 28 settembre 2016 15:26
            *A:* users@spamassassin.apache.org
            <mailto:users@spamassassin.apache.org>
            *Oggetto:* Re: regular expression needed

            On 9/28/2016 9:02 AM, Nicola Piazzi wrote:




                Usually we receive spam having subjects like these
                examples in order of time :





                Subject From
                                                               To

                FedEx Shipment 702193383647 Notification
                j...@company1.com <mailto:j...@company1.com>
                s...@mycompany.it <mailto:s...@mycompany.it>

                FedEx Shipment 722566383641 Notification
                a...@other.com <mailto:a...@other.com>
                a...@mycompany.it <mailto:a...@mycompany.it>

                FedEx Shipment 734563383644 Notification
                i...@company1.com <mailto:i...@company1.com>
                lo...@mycompany.it <mailto:lo...@mycompany.it>

                A package for you jim b...@cocacola.com
                <mailto:b...@cocacola.com> j...@mycompany.it
                <mailto:j...@mycompany.it>

                A package for you sue j...@buster.com
                <mailto:j...@buster.com> s...@mycompany.it
                <mailto:s...@mycompany.it>

                These come from viruses that infect different pcs in
                the word that send same spam

                I want to write a plugin that test each email giving
                penality to these mails

                Detection routine

                A mail arrive

                Subject is : FedEx Shipment 702193383647 Notification

                I search in maillog table for a regex that MATCH FedEx
                Shipment 702193383647 Notification ALSO IN FedEx
                Shipment 722566383641 Notification AND IN FedEx
                Shipment 734563383644 Notification

                If it match I verify that FROM DOMAIN IS DIFFERENT
                And then I verify that TO ADDRESS IS DIFFERENT

                Now I need a regex sintax to put all extracted words
                of PHRASE FedEx Shipment 734563383644 Notification and
                match if it found at least 3 of 4 words

                Someone can help ?


            I don't follow exactly what you are trying to do in the
            description above, but for that problem, I would start
            with something like this:

            header  __FEDEX_ADDR From:addr /\@fedex\.com/
            header __FEDEX_SUBJ Subject /FedEx Shipment/
            meta FEDEX_SPAM  __FEDEX_SUBJ && ! __FEDEX_ADDR
            score FEDEX_SPAM 2.0

            (Off the top of my head and completely untested.  Adjust
            score as required.)

            This will hit any email with "FedEx Shipment" in the
            subject that doesn't come from fedex.com.  Note that it
            will also hit on any legitimate FedEx emails that have
            been forwarded.  You could minimize this by constraining
            the subject match to be at the beginning of the line
            (/^Fedex Shipment/).  This may or may not have an effect
            on spam detection.  You could also do a test for non-FedEx
            urls in the body rather than looking at the sender.

            You could use a simple subject line test for the "A
            package for you" emails, unless you know of a valid
            delivery service that uses that phrase.

-- Bowie


Reply via email to