Re: relayd crashing some times

Kapetanakis Giannis Thu, 06 Jul 2023 03:56:02 -0700

On 06/07/2023 12:59, Alexandr Nedvedicky wrote:
> Hello,
>
>
> On Wed, Jul 05, 2023 at 03:33:36PM +0300, Kapetanakis Giannis wrote:
>> In any case we shouldn't get table statistics for tables that are down.
>>
>> log_debug are just added by for my debugging, but the
>> if (!rdr->table->up) continue;
>> should probably go in.
>>
>> G
>>
>> Index: pfe.c
>> ===================================================================
>> RCS file: /cvs/src/usr.sbin/relayd/pfe.c,v
>> retrieving revision 1.90
>> diff -u -p -r1.90 pfe.c
>> --- pfe.c       14 Sep 2020 11:30:25 -0000      1.90
>> +++ pfe.c       5 Jul 2023 12:27:37 -0000
>> @@ -790,6 +790,12 @@ pfe_statistics(int fd, short events, voi
>>         getmonotime(&tv_now);
>>
>>         TAILQ_FOREACH(rdr, env->sc_rdrs, entry) {
>> +               if (!rdr->table->up) {
>> +                       log_debug("%s: table: %s is down. continuing", 
>> __func__, rdr->conf.name);
>> +                       continue;
>> +               }
>> +               //bilias
>> +               log_debug("%s: table: %s, up: %d id: %d", __func__, 
>> rdr->conf.name, rdr->table->up, rdr->conf.table_id);
>>                 cnt = check_table(env, rdr, rdr->table);
>>                 if (rdr->conf.backup_id != EMPTY_TABLE)
>>                         cnt += check_table(env, rdr, rdr->backup);
>>
>>
> this change might be valid fix. however I'm afraid it will uncover new issues.
> in relayd. It just seems like a tip of iceberg.
>
> I think I've found explanation how table can disappear when host gets
> disabled. I hope someone who is more familiar with relayd can fill
> some details I'm still missing here.
>
> snippet here comes from disable_host() function in pfe.c:
>
> 602         if (host->up == HOST_UP) {
> 603                 if ((table = table_find(env, host->conf.tableid)) == NULL)
> 604                         fatalx("%s: invalid table id", __func__);
> 605                 table->up--;
> 606                 table->conf.flags |= F_CHANGED;
> 607         }
> 608
> 609         host->up = HOST_UNKNOWN;
> ...
> 624         if (!host->conf.parentid) {
> 625                 /* Disable all children */
> 626                 SLIST_FOREACH(h, &host->children, child)
> 627                         disable_host(c, id, h);
> 628                 pfe_sync();
> 629         }
>
>     at line 605 table->up counter may get to 0 if host is the only/last
>     host in table.
>
>     if there is no parent, we are going disable children too calling
>     to pfe_sync() at line 628.
>
> pfe_sync() function updates rdr rules in pf. The logic there is still
> kind of blurry to me.
>
> 687         TAILQ_FOREACH(rdr, env->sc_rdrs, entry) {
> 688                 rdr->conf.flags &= ~(F_BACKUP);
> 689                 rdr->conf.flags &= ~(F_DOWN);
> 690
> 691                 if (rdr->conf.flags & F_DISABLE ||
> 692                     (rdr->table->up == 0 && rdr->backup->up == 0)) {
> 693                         rdr->conf.flags |= F_DOWN;
> 694                         active = NULL;
> ...
> 714                 if (rdr->conf.flags & F_DOWN) {
> 715                         if (rdr->conf.flags & F_ACTIVE_RULESET) {
> 716                                 flush_table(env, rdr);
> 717                                 log_debug("%s: disabling ruleset", 
> __func__);
> 718                                 rdr->conf.flags &= ~(F_ACTIVE_RULESET);
> 719                                 id.id = rdr->conf.id;
> 720                                 imsg.hdr.type = IMSG_CTL_PULL_RULESET;
> 721                                 imsg.hdr.len = sizeof(id) + 
> IMSG_HEADER_SIZE;
> 722                                 imsg.data = &id;
> 723                                 sync_ruleset(env, rdr, 0);
> 724                                 control_imsg_forward(env->sc_ps, &imsg);
> 725                         }
>
> here we walk through the list of redirect rules. at line 693 we mark
> redirect to be down if there is no table and no back up.
>
> if redirect is down and has active rules we are going to flush
> those rules at line 723 by calling sync_ruleset(env, rdr, 0);
> the flush operation done by sync_ruleset() reads as follows:
>
> 332 void
> 333 sync_ruleset(struct relayd *env, struct rdr *rdr, int enable)
> ...
> 352         if (transaction_init(env, anchor) == -1) {
> 353                 log_warn("%s: transaction init failed", __func__);
> 354                 return;
> 355         }
> 356
> 357         if (!enable) {
> 358                 if (transaction_commit(env) == -1)
> 359                         log_warn("%s: remove rules transaction failed",
> 360                             __func__);
> 361                 else
> 362                         log_debug("%s: rules removed", __func__);
> 363                 return;
> 364         }
> 365
>
> as you can see the rules are removed by committing empty ryleset.
> committing empty ruleset makes anchor empty, because commit operation
> removes also tables which are not persistent. One can emulate that by using
> pfctl on command line.
>     we start by loading simple rdr rule which reference table bar:
>
>       pf# cat bar.conf
>       pass in on vio0 from 192.168.3.0/24 to 10.164.0.3 rdr-to <bar>
>       pf# pfctl -a relayd/bar -f bar.conf
>
>     the next step is to check if table bar got created.
>
>       pf# pfctl -a relayd/bar -sT -vg
>       ----r-- bar     relayd/bar
>
>     the 'r' flag means table is referenced only. Such table is
>     not reported unless '-sT' option comes with '-vg'. Now let's
>     add address to bar table and see what happens:
>
>       pf# pfctl -a relayd/bar -t bar -T add 192.168.1.10
>       1/1 addresses added.
>       pf# pfctl -a relayd/bar -sT -vg
>       --a-r-- bar     relayd/bar
>
>     there is 'a' flag which indicates table became active. The active
>     table is visible for '-sT' (show tables):
>
>       pf#  pfctl -a relayd/bar -sT
>       bar
>
>     now let's see what happens if we clear rules from relayd/bar anchor
>     by committing empty ruleset:
>
>       pf# echo '' | pfctl -a relayd/bar -f -
>       pf#  pfctl -a relayd/bar -sT -vg
>       pfctl: Anchor does not exist
>
>     both table and relayd/bar are gone. I thing this might be the mechanism
>     which creates surprise for relayd
>
> on the other hand if I create the bar table first before the
> rule which references the table is loaded, then table is created
> with persistent flag. So the order in which load rules and
> tables does matter:
>
>     pf#  pfctl -a relayd/bar -sT -vg
>     pfctl: Anchor does not exist
>
>     pf# pfctl -a relayd/bar -t bar -T add 192.168.1.10
>     1 table created.
>     1/1 addresses added.
>
>     pf# pfctl -a relayd/bar -sT -vg
>     -pa---- bar     relayd/bar
>
>     pf# pfctl -a relayd/bar -f bar.conf
>     pf# pfctl -a relayd/bar -sT -vg 
>     -pa-r-- bar     relayd/bar
>
>     pf# echo '' | pfctl -a relayd/bar -f -
>     pf# pfctl -a relayd/bar -sT -vg
>     -pa---- bar     relayd/bar
>
> looks like order matters. If we load/create table before rules,
> then table gets its 'persistent' flag. If table is loaded after
> rules, then it loses its persistent flag and may get lost when
> rules are flushed.
>
> can you check table flags on your relay box?
>       pfctl -a relayd/... -sT -vg
> just to see if my speculation is good enough or completely off.
>
> thanks and
> regards
> sashan


Hi sashan,

Your logic seems fine to me. Indeed it acts like they are persistent. They 
don't have the p flag.

# pfctl -a 'relayd/dir-imap' -sT -vg
--a-r-- dir-imap        relayd/dir-imap
# pfctl -a 'relayd/dir-sieve' -sT -vg
--a-r-- dir-sieve       relayd/dir-sieve
# pfctl -a 'relayd/dir-lmtp' -sT -vg
--a-r-- dir-lmtp        relayd/dir-lmtp

note that in pfe_filter.c in init_tables() is see

73: tables[i].pfrt_flags |= PFR_TFLAG_PERSIST;

Giannis
ps. I've send a diff about statistics in tech@
It does more than just checking table->up status, but your approach is much 
better.

Re: relayd crashing some times

Reply via email to