Hello,

I will try your diff, but since I have to completely turn off mail service it 
might take a while.

Meanwhile, just a wild guess from my side, although I'm not a dev:

It seems to me that a table is being removed, specifically the table that has 
the hosts for the redirect.
It's like after some active sessions expire (1-2min delay), the table is being 
removed like it's not persistent. Why did the table was removed on the first 
place? Maybe because there was no active host inside that table (table empty).

Then some statistics is being called on that table and it exits since it's not 
there.

If that's the case then it's should indeed not call statistics on the disabled 
table.

regards,

Giannis
ps. I cannot replicate (without new diff) if the load balancer does not have 
active sessions on the redirect when I disable. It also does not happen on the 
backup load balancer

On 03/07/2023 19:18, Alexandr Nedvedicky wrote:
> Hello,
>
> I took a closer look at relayd logs you obtained and tried to locate
> the origin of those messages in source code. the exact mechanism is
> still mystery to me.
>
> I'll start with brief summary:
>
>> Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table stats: 
>> No such file or directory
>     line above is what pfe process reports right before it exits:
>
> 633
> 634         if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1)
> 635                 fatal("%s: cannot get table stats for %s@%s", __func__,
> 636                     io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor);
> 637
>
>     snippet above comes from pfe_filter.c:check_table() function,
>     which is called from  pfe.c:pfe_statistics(). pfe_statics()
>     function is being called periodically from timer. I've noticed
>     there is function pfe_disable_events() which should disable
>     timer (stops calling to pfe_statistics()). However pfe_disable_events()
>     is unused we never call it.
>
>> Jun 30 01:45:59 ll1 relayd[52103]: incremented the demote state of group 
>> '0relay'
>     line above comes from here carp_demote_ioctl() here:
>
> 214         else
> 215                 log_info("%s the demote state of group '%s'",
> 216                     (demote > 0) ? "incremented" : "decremented", group);
>
>     carp_demote_ioctl() is being called carp_demote_shutdown() which
>     itself is being called from  parent_shutdown():
>
>  373 void
>  374 parent_shutdown(struct relayd *env)
>  375 {
>  376         config_purge(env, CONFIG_ALL);
>  377
>  378         proc_kill(env->sc_ps);
>  379         control_cleanup(&env->sc_ps->ps_csock);
>  380         carp_demote_shutdown();
>  381
>  382         free(env->sc_ps);
>  383         free(env);
>  384
>  385         log_info("parent terminating, pid %d", getpid());
>  386
>  387         exit(0);
>  388 }
>
>     the relayd parent process is going to piecefully exit anyway. The parent
>     exit is confirmed by those lines in log:
>> Jun 30 01:47:46 ll1 relayd[52103]: decremented the demote state of group 
>> '0relay'
>> Jun 30 01:47:46 ll1 relayd[52103]: parent terminating, pid 52103
>     the only way how we could arrive to parent_shutdown() function
>     is after receiving IMSG_CTL_SHUTDOWN which is sent on behalf
>     of command 'relactl stop' which I have no idea where it got
>     called from.
>
>     anyway I suspect we must disable periodic event in `pfe`
>     process to avoid unexpected exit via call to fatal().
>
> can you give a try to diff below?
>
> thanks and
> regards
> sashan
>
> --------8<---------------8<---------------8<------------------8<--------
> diff --git a/usr.sbin/relayd/pfe.c b/usr.sbin/relayd/pfe.c
> index 3a97b749c4b..ad9c9cdc0cc 100644
> --- a/usr.sbin/relayd/pfe.c
> +++ b/usr.sbin/relayd/pfe.c
> @@ -93,6 +93,7 @@ pfe_init(struct privsep *ps, struct privsep_proc *p, void 
> *arg)
>  void
>  pfe_shutdown(void)
>  {
> +     pfe_disable_events();
>       flush_rulesets(env);
>       config_purge(env, CONFIG_ALL);
>  }

Reply via email to