If you're running almalinux, then report it to almalinux, they can usually patch things faster than RedHat can.
Sent from Outlook for iOS<https://aka.ms/o0ukef> ________________________________ From: Gerald Vogt <v...@spamcop.net> Sent: Monday, June 24, 2024 9:59:35 AM To: users@spamassassin.apache.org <users@spamassassin.apache.org> Subject: Re: BayesStore MariaDB on EL9 Hi, for your information and anyone who comes across this problem: I have opened an issue with RedHat. https://issues.redhat.com/browse/RHEL-43418 It probably will be backported, but may take some time, maybe in 9.5 or possibly later. We'll see... Regards, Gerald On 19.06.24 08:41, Gerald Vogt wrote: > On 18.06.24 22:23, Bill Cole wrote: >> On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200) >> Gerald Vogt <v...@spamcop.net> >> is rumored to have said: >> >>> Hi, >>> >>> for a test, I have increased the column length of token to binary(32) >>> and used a test file to import containing a single token. >>> >>> This time it went through. However, as I suspected, the token length >>> is not 5 byte. Token line from backup: >>> >>> t 1 0 1718024618 027121926a >>> >>> Hex representation of content in database: >>> >>> MariaDB [spamassassin]> select hex(token) from bayes_token\G >>> *************************** 1. row *************************** >>> hex(token): >>> 027121C2926A0000000000000000000000000000000000000000000000000000 >>> 1 row in set (0.000 sec) >>> >>> Compared: >>> >>> Original 02 71 21 92 6a >>> Database 02 71 21 C2 92 6A >>> >>> C2 92 is the UTF-8 encoding of U+0092, thus basically the token is >>> written in UTF-8 into the database. >> >> That's odd... What is the character set of the database? > > It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci > just like the table. > >>> Running sa-learn with DBI_TRACE=2 I can also see that it looks like >>> it actually has the UTF-8 encoding already in there during parameter >>> binding: >>> >>> Binding parameters: INSERT INTO bayes_token >>> (id, token, spam_count, ham_count, atime) >>> VALUES ('43','^Bq!<U+0092>j','1','0','1718024618') >>> ON DUPLICATE KEY UPDATE spam_count = >>> GREATEST(spam_count + '1', 0), >>> ham_count = GREATEST(ham_count >>> + '0', 0), >>> atime = GREATEST(atime, >>> '1718024618') >>> >>> Thus, I would say it's not an issue with the database. >>> >>> Any idea? >>> >>> Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4. >> >> First: upgrade to 4.0.1 > > Well, it's the RHEL packaged version. I don't really want to upgrade to > a manually handled version. > >> There were substantial changes in how encoding was handled between >> 3.4.6 and 4.0, and there is a substantial likelihood that any problem >> with encoding would not occur in 4.0 or later. > > Yes, you are right. It works with 4.0.1. > > I have looked into the source code and the reason became obvious pretty > quickly, e.g. the part in _put_token in 3.4.6 > > https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827 > > compared with this in trunk > > https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997 > > 4.0 does specifically tag the token as BINARY while default is VARCHAR I > think. Thus, it automatically encodes it. > > This was added in > > https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489 > > I'll open a bug with redhat and see if they either upgrade spamassassin > in EL9 or backport something into 3.4.6. > > Just for the fun of it, I have replaced the packaged file with the 4.0.1 > MySQL.pm file and then it works. Looking at the commit and the commit > history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6. > > Anyway, we'll see what RedHat does about this. > > Thanks a lot! > > Regards, > > Gerald