On 18.06.24 22:23, Bill Cole wrote:
On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200)
Gerald Vogt <v...@spamcop.net>
is rumored to have said:
Hi,
for a test, I have increased the column length of token to binary(32)
and used a test file to import containing a single token.
This time it went through. However, as I suspected, the token length
is not 5 byte. Token line from backup:
t 1 0 1718024618 027121926a
Hex representation of content in database:
MariaDB [spamassassin]> select hex(token) from bayes_token\G
*************************** 1. row ***************************
hex(token):
027121C2926A0000000000000000000000000000000000000000000000000000
1 row in set (0.000 sec)
Compared:
Original 02 71 21 92 6a
Database 02 71 21 C2 92 6A
C2 92 is the UTF-8 encoding of U+0092, thus basically the token is
written in UTF-8 into the database.
That's odd... What is the character set of the database?
It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci
just like the table.
Running sa-learn with DBI_TRACE=2 I can also see that it looks like it
actually has the UTF-8 encoding already in there during parameter
binding:
Binding parameters: INSERT INTO bayes_token
(id, token, spam_count, ham_count, atime)
VALUES ('43','^Bq!<U+0092>j','1','0','1718024618')
ON DUPLICATE KEY UPDATE spam_count =
GREATEST(spam_count + '1', 0),
ham_count = GREATEST(ham_count
+ '0', 0),
atime = GREATEST(atime,
'1718024618')
Thus, I would say it's not an issue with the database.
Any idea?
Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.
First: upgrade to 4.0.1
Well, it's the RHEL packaged version. I don't really want to upgrade to
a manually handled version.
There were substantial changes in how encoding was handled between 3.4.6
and 4.0, and there is a substantial likelihood that any problem with
encoding would not occur in 4.0 or later.
Yes, you are right. It works with 4.0.1.
I have looked into the source code and the reason became obvious pretty
quickly, e.g. the part in _put_token in 3.4.6
https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827
compared with this in trunk
https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997
4.0 does specifically tag the token as BINARY while default is VARCHAR I
think. Thus, it automatically encodes it.
This was added in
https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489
I'll open a bug with redhat and see if they either upgrade spamassassin
in EL9 or backport something into 3.4.6.
Just for the fun of it, I have replaced the packaged file with the 4.0.1
MySQL.pm file and then it works. Looking at the commit and the commit
history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6.
Anyway, we'll see what RedHat does about this.
Thanks a lot!
Regards,
Gerald