Re: relayd crashing some times

Alexandr Nedvedicky Thu, 06 Jul 2023 03:01:02 -0700

Hello,


On Wed, Jul 05, 2023 at 03:33:36PM +0300, Kapetanakis Giannis wrote:
> In any case we shouldn't get table statistics for tables that are down.
>
> log_debug are just added by for my debugging, but the
> if (!rdr->table->up) continue;
> should probably go in.
>
> G
>
> Index: pfe.c
> ===================================================================
> RCS file: /cvs/src/usr.sbin/relayd/pfe.c,v
> retrieving revision 1.90
> diff -u -p -r1.90 pfe.c
> --- pfe.c       14 Sep 2020 11:30:25 -0000      1.90
> +++ pfe.c       5 Jul 2023 12:27:37 -0000
> @@ -790,6 +790,12 @@ pfe_statistics(int fd, short events, voi
>         getmonotime(&tv_now);
>
>         TAILQ_FOREACH(rdr, env->sc_rdrs, entry) {
> +               if (!rdr->table->up) {
> +                       log_debug("%s: table: %s is down. continuing", 
> __func__, rdr->conf.name);
> +                       continue;
> +               }
> +               //bilias
> +               log_debug("%s: table: %s, up: %d id: %d", __func__, 
> rdr->conf.name, rdr->table->up, rdr->conf.table_id);
>                 cnt = check_table(env, rdr, rdr->table);
>                 if (rdr->conf.backup_id != EMPTY_TABLE)
>                         cnt += check_table(env, rdr, rdr->backup);
>
>

this change might be valid fix. however I'm afraid it will uncover new issues.
in relayd. It just seems like a tip of iceberg.

I think I've found explanation how table can disappear when host gets
disabled. I hope someone who is more familiar with relayd can fill
some details I'm still missing here.

snippet here comes from disable_host() function in pfe.c:

602         if (host->up == HOST_UP) {
603                 if ((table = table_find(env, host->conf.tableid)) == NULL)
604                         fatalx("%s: invalid table id", __func__);
605                 table->up--;
606                 table->conf.flags |= F_CHANGED;
607         }
608
609         host->up = HOST_UNKNOWN;
...
624         if (!host->conf.parentid) {
625                 /* Disable all children */
626                 SLIST_FOREACH(h, &host->children, child)
627                         disable_host(c, id, h);
628                 pfe_sync();
629         }

    at line 605 table->up counter may get to 0 if host is the only/last
    host in table.

    if there is no parent, we are going disable children too calling
    to pfe_sync() at line 628.

pfe_sync() function updates rdr rules in pf. The logic there is still
kind of blurry to me.

687         TAILQ_FOREACH(rdr, env->sc_rdrs, entry) {
688                 rdr->conf.flags &= ~(F_BACKUP);
689                 rdr->conf.flags &= ~(F_DOWN);
690
691                 if (rdr->conf.flags & F_DISABLE ||
692                     (rdr->table->up == 0 && rdr->backup->up == 0)) {
693                         rdr->conf.flags |= F_DOWN;
694                         active = NULL;
...
714                 if (rdr->conf.flags & F_DOWN) {
715                         if (rdr->conf.flags & F_ACTIVE_RULESET) {
716                                 flush_table(env, rdr);
717                                 log_debug("%s: disabling ruleset", 
__func__);
718                                 rdr->conf.flags &= ~(F_ACTIVE_RULESET);
719                                 id.id = rdr->conf.id;
720                                 imsg.hdr.type = IMSG_CTL_PULL_RULESET;
721                                 imsg.hdr.len = sizeof(id) + 
IMSG_HEADER_SIZE;
722                                 imsg.data = &id;
723                                 sync_ruleset(env, rdr, 0);
724                                 control_imsg_forward(env->sc_ps, &imsg);
725                         }

here we walk through the list of redirect rules. at line 693 we mark
redirect to be down if there is no table and no back up.

if redirect is down and has active rules we are going to flush
those rules at line 723 by calling sync_ruleset(env, rdr, 0);
the flush operation done by sync_ruleset() reads as follows:

332 void
333 sync_ruleset(struct relayd *env, struct rdr *rdr, int enable)
...
352         if (transaction_init(env, anchor) == -1) {
353                 log_warn("%s: transaction init failed", __func__);
354                 return;
355         }
356
357         if (!enable) {
358                 if (transaction_commit(env) == -1)
359                         log_warn("%s: remove rules transaction failed",
360                             __func__);
361                 else
362                         log_debug("%s: rules removed", __func__);
363                 return;
364         }
365

as you can see the rules are removed by committing empty ryleset.
committing empty ruleset makes anchor empty, because commit operation
removes also tables which are not persistent. One can emulate that by using
pfctl on command line.
    we start by loading simple rdr rule which reference table bar:

        pf# cat bar.conf
        pass in on vio0 from 192.168.3.0/24 to 10.164.0.3 rdr-to <bar>
        pf# pfctl -a relayd/bar -f bar.conf

    the next step is to check if table bar got created.

        pf# pfctl -a relayd/bar -sT -vg
        ----r-- bar     relayd/bar

    the 'r' flag means table is referenced only. Such table is
    not reported unless '-sT' option comes with '-vg'. Now let's
    add address to bar table and see what happens:

        pf# pfctl -a relayd/bar -t bar -T add 192.168.1.10
        1/1 addresses added.
        pf# pfctl -a relayd/bar -sT -vg
        --a-r-- bar     relayd/bar

    there is 'a' flag which indicates table became active. The active
    table is visible for '-sT' (show tables):

        pf#  pfctl -a relayd/bar -sT
        bar

    now let's see what happens if we clear rules from relayd/bar anchor
    by committing empty ruleset:

        pf# echo '' | pfctl -a relayd/bar -f -
        pf#  pfctl -a relayd/bar -sT -vg
        pfctl: Anchor does not exist

    both table and relayd/bar are gone. I thing this might be the mechanism
    which creates surprise for relayd

on the other hand if I create the bar table first before the
rule which references the table is loaded, then table is created
with persistent flag. So the order in which load rules and
tables does matter:

    pf#  pfctl -a relayd/bar -sT -vg
    pfctl: Anchor does not exist

    pf# pfctl -a relayd/bar -t bar -T add 192.168.1.10
    1 table created.
    1/1 addresses added.

    pf# pfctl -a relayd/bar -sT -vg
    -pa---- bar     relayd/bar

    pf# pfctl -a relayd/bar -f bar.conf
    pf# pfctl -a relayd/bar -sT -vg 
    -pa-r-- bar     relayd/bar

    pf# echo '' | pfctl -a relayd/bar -f -
    pf# pfctl -a relayd/bar -sT -vg
    -pa---- bar     relayd/bar

looks like order matters. If we load/create table before rules,
then table gets its 'persistent' flag. If table is loaded after
rules, then it loses its persistent flag and may get lost when
rules are flushed.

can you check table flags on your relay box?
        pfctl -a relayd/... -sT -vg
just to see if my speculation is good enough or completely off.

thanks and
regards
sashan

Re: relayd crashing some times

Reply via email to