If you're running almalinux, then report it to almalinux, they can usually 
patch things faster than RedHat can.

Sent from Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Gerald Vogt <v...@spamcop.net>
Sent: Monday, June 24, 2024 9:59:35 AM
To: users@spamassassin.apache.org <users@spamassassin.apache.org>
Subject: Re: BayesStore MariaDB on EL9

Hi,

for your information and anyone who comes across this problem: I have
opened an issue with RedHat.

https://issues.redhat.com/browse/RHEL-43418

It probably will be backported, but may take some time, maybe in 9.5 or
possibly later.

We'll see...

Regards,

Gerald

On 19.06.24 08:41, Gerald Vogt wrote:
> On 18.06.24 22:23, Bill Cole wrote:
>> On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200)
>> Gerald Vogt <v...@spamcop.net>
>> is rumored to have said:
>>
>>> Hi,
>>>
>>> for a test, I have increased the column length of token to binary(32)
>>> and used a test file to import containing a single token.
>>>
>>> This time it went through. However, as I suspected, the token length
>>> is not 5 byte. Token line from backup:
>>>
>>> t    1    0    1718024618    027121926a
>>>
>>> Hex representation of content in database:
>>>
>>> MariaDB [spamassassin]> select hex(token) from bayes_token\G
>>> *************************** 1. row ***************************
>>> hex(token):
>>> 027121C2926A0000000000000000000000000000000000000000000000000000
>>> 1 row in set (0.000 sec)
>>>
>>> Compared:
>>>
>>> Original 02 71 21    92 6a
>>> Database 02 71 21 C2 92 6A
>>>
>>> C2 92 is the UTF-8 encoding of U+0092, thus basically the token is
>>> written in UTF-8 into the database.
>>
>> That's odd... What is the character set of the database?
>
> It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci
> just like the table.
>
>>> Running sa-learn with DBI_TRACE=2 I can also see that it looks like
>>> it actually has the UTF-8 encoding already in there during parameter
>>> binding:
>>>
>>> Binding parameters: INSERT INTO bayes_token
>>>                (id, token, spam_count, ham_count, atime)
>>>                VALUES ('43','^Bq!<U+0092>j','1','0','1718024618')
>>>                ON DUPLICATE KEY UPDATE spam_count =
>>> GREATEST(spam_count + '1', 0),
>>>                                        ham_count = GREATEST(ham_count
>>> + '0', 0),
>>>                                        atime = GREATEST(atime,
>>> '1718024618')
>>>
>>> Thus, I would say it's not an issue with the database.
>>>
>>> Any idea?
>>>
>>> Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.
>>
>> First: upgrade to 4.0.1
>
> Well, it's the RHEL packaged version. I don't really want to upgrade to
> a manually handled version.
>
>> There were substantial changes in how encoding was handled between
>> 3.4.6 and 4.0, and there is a substantial likelihood that any problem
>> with encoding would not occur in 4.0 or later.
>
> Yes, you are right. It works with 4.0.1.
>
> I have looked into the source code and the reason became obvious pretty
> quickly, e.g. the part in _put_token in 3.4.6
>
> https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827
>
> compared with this in trunk
>
> https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997
>
> 4.0 does specifically tag the token as BINARY while default is VARCHAR I
> think. Thus, it automatically encodes it.
>
> This was added in
>
> https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489
>
> I'll open a bug with redhat and see if they either upgrade spamassassin
> in EL9 or backport something into 3.4.6.
>
> Just for the fun of it, I have replaced the packaged file with the 4.0.1
> MySQL.pm file and then it works. Looking at the commit and the commit
> history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6.
>
> Anyway, we'll see what RedHat does about this.
>
> Thanks a lot!
>
> Regards,
>
> Gerald

Reply via email to