Hi all! We used to have postfix 2.6.11 in our systems, which was then updated with no problems to 3.3.2. However, during a benchmark, we realized 3.3.2 was 5 times slower than the version before. Even after disabling all mail filters the slowdown was still the same.
The benchmark consists of a large amount of emails being fetched and entering postfix via the smtpd daemon and then handed over to cyrus via lmtp and socket to be delivered mostly to the same recipient. At first, we thought maybe the concurrency limits have changed, or maybe now there are less processes running, but all the configuration variables regarding this have not changed. The benchmark uses 30000 emails, but generating 20 emails is sufficient to see the bug. After trials and errors and running out of ideas, I decided to bisect the code to see what made this change. It happens that version 2.11.1 behaves as slow as 3.3.2, and version 2.11.0 as fast as 2.6.11. All with the same hardware, same configuration, same load. With version >= 2.11.1, the first emails are received and processed somewhat instantly, after some time it seems that lmtp is not processing anything and just waiting while all emails are fetched, so afterwards it starts delivering to the recipients at the rate of about 2 per second. (See excerpt of the log file) Excerpt of maillog version > 2.11.1: Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 7079468: from=<d...@localdomain.com>, size=356, nrcpt=1 (queue active) Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61: to=<d...@localdomain.com>, orig_to=<dst@localhost>, relay=localdomain.com[/var/ imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694410-1>) Mar 27 14:46:50 localdomain postfix/smtpd[24746]: 718F46A: client=localdomain.com[127.0.0.1] Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6CEFF61: removed Mar 27 14:46:50 localdomain postfix/cleanup[24748]: 718F46A: message- id=<20190327134650.718f...@localdomain.com> Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 718F46A: from=<d...@localdomain.com>, size=356, nrcpt=1 (queue active) Mar 27 14:46:50 localdomain postfix/smtpd[24746]: 7273461: client=localdomain.com[127.0.0.1] Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63: to=<d...@localdomain.com>, orig_to=<dst@localhost>, relay=localdomain.com[/var/ imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-24753-1553694410-1>) Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6E8CE63: removed ... ... Mar 27 14:46:50 localdomain postfix/cleanup[24748]: 764A06A: message- id=<20190327134650.764a...@localdomain.com> Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 764A06A: from=<d...@localdomain.com>, size=356, nrcpt=1 (queue active) Mar 27 14:46:50 localdomain postfix/smtpd[24746]: 778DC61: client=localdomain.com[127.0.0.1] Mar 27 14:46:50 localdomain postfix/cleanup[24758]: 778DC61: message- id=<20190327134650.778d...@localdomain.com> Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 778DC61: from=<d...@localdomain.com>, size=356, nrcpt=1 (queue active) Mar 27 14:46:50 localdomain postfix/smtpd[24746]: 786A96B: client=localdomain.com[127.0.0.1] Mar 27 14:46:50 localdomain postfix/cleanup[24758]: 786A96B: message- id=<20190327134650.786a...@localdomain.com> ... ... Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A: to=<d...@localdomain.com>, orig_to=<dst@localhost>, relay=localdomain.com[/var/ imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-24757-1553694412-1>) Mar 27 14:46:52 localdomain postfix/qmgr[24339]: 764A06A: removed Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61: to=<d...@localdomain.com>, orig_to=<dst@localhost>, relay=localdomain.com[/var/ imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694414-1>) Mar 27 14:46:54 localdomain postfix/qmgr[24339]: 778DC61: removed ... ... Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72: to=<d...@localdomain.com>, orig_to=<dst@localhost>, relay=localdomain.com[/var/ imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694418-1>) Mar 27 14:46:58 localdomain postfix/qmgr[24339]: 7CFDE72: removed With version <= 2.11.0, the messages are delivered as soon as they arrive. Excerpt of maillog version <= 2.11.0: ... ... Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AA6CC67: removed Mar 27 15:00:35 localdomain postfix/smtpd[26446]: AC80663: client=localdomain.com[127.0.0.1] Mar 27 15:00:35 localdomain postfix/cleanup[26456]: AC80663: message- id=<20190327140035.ac80...@localdomain.com> Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AC80663: from=<d...@localdomain.com>, size=356, nrcpt=1 (queue active) Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AB58261: to=<d...@localdomain.com>, orig_to=<dst@localhost>, relay=localdomain.com[/var/ imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5, status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-24769-1553695235-7>) Mar 27 15:00:35 localdomain postfix/smtpd[26446]: AD4EB67: client=localdomain.com[127.0.0.1] Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AB58261: removed Mar 27 15:00:35 localdomain postfix/cleanup[26456]: AD4EB67: message- id=<20190327140035.ad4e...@localdomain.com> Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AD4EB67: from=<d...@localdomain.com>, size=356, nrcpt=1 (queue active) Mar 27 15:00:35 localdomain postfix/smtpd[26446]: disconnect from localdomain.com[127.0.0.1] Mar 27 15:00:35 localdomain postfix/lmtp[26453]: AC80663: to=<d...@localdomain.com>, orig_to=<dst@localhost>, relay=localdomain.com[/var/ imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5, status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-26455-1553695235-5>) Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AC80663: removed Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AD4EB67: to=<d...@localdomain.com>, orig_to=<dst@localhost>, relay=localdomain.com[/var/ imap/socket/lmtp], delay=0.01, delays=0/0/0/0, dsn=2.1.5, status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-26452-1553695235-7>) Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AD4EB67: removed Going deeper with bisecting the code, I found out that the following line created this behaviour change: diff --git a/postfix/src/smtp/smtp.h b/postfix/src/smtp/smtp.h index 336a4f47..5437088e 100644 --- a/postfix/src/smtp/smtp.h +++ b/postfix/src/smtp/smtp.h @@ -195,7 +195,7 @@ typedef struct SMTP_STATE { STR((state)->iterator->request_nexthop)[0] = 0; \ } -#define HAVE_NEXTHOP_STATE(state) (STR((state)->iterator->request_nexthop) != 0) +#define HAVE_NEXTHOP_STATE(state) (STR((state)->iterator->request_nexthop)[0] != 0) /* This changes the result of the conditions in a few places throughout the smtp code. Also, "request_nexthop" can only acquire a value different than 0 during the method "smtp_connect_inet", which probably shouldn't run during a unix socket connection. Questions: (1) Does it mean that the lmtp connections via socket are never cached or reused? (2) Is this expected behaviour? Is this a bug? Other configs and outputs: Excerpt from main.cf: (only the configs that I believe are pertinent) mailbox_transport = lmtp:unix:/var/imap/socket/lmtp lmtp_destination_recipient_limit = 1 local_destination_concurrency_limit = 2 Output from scache after 20 emails: version <= 2.11.0: postfix/scache[26740]: statistics: start interval Mar 27 15:20:50 postfix/scache[26740]: statistics: domain lookup hits=7 miss=8 success=46% postfix/scache[26740]: statistics: max simultaneous domains=1 addresses=1 connection=2 version >= 2.11.1: postfix/scache[29823]: statistics: start interval Mar 27 15:24:59 postfix/scache[29823]: statistics: domain lookup hits=0 miss=17 success=0% postfix/scache[29823]: statistics: max simultaneous domains=0 addresses=1 connection=4 Some help to make sense of all of it would be really high appreciated. Julie.