Hello, I'm not familiar enough with relayd, so perhaps other folks here might provide better way to troubleshoot the issue.
On Fri, Jun 30, 2023 at 11:10:44AM +0300, Kapetanakis Giannis wrote: > Hello, > > This happened to me twice. > OpenBSD 7.3 with syspatches. > > I have a pair of carp/pfsync/pf/relayd firewall-load balancers with many > redirects (only) on them. > > I wanted to do maintenance of some hosts bellow load balancers. > After a while relayd crashed on Master firewall only. when you say crash: does it mean the relayd was terminated by system because of memory/stack/program violation? if it is the case is there any chance to collect core file? or was it rather voluntary exit, when relayd called its function fatal() the 'No such file or director' error code, which comes from DIOCRGETTSTATS ioctl() come from line 1746 in sys/net/pf_table.c: 1731 int 1732 pfr_get_tstats(struct pfr_table *filter, struct pfr_tstats *tbl, int *size, 1733 int flags) 1734 { 1735 struct pfr_ktable *p; 1736 struct pfr_ktableworkq workq; 1737 int n, nn; 1738 time_t tzero = gettime(); 1739 1740 /* XXX PFR_FLAG_CLSTATS disabled */ 1741 ACCEPT_FLAGS(flags, PFR_FLAG_ALLRSETS); 1742 if (pfr_fix_anchor(filter->pfrt_anchor)) 1743 return (EINVAL); 1744 n = nn = pfr_table_count(filter, flags); 1745 if (n < 0) 1746 return (ENOENT); the pfr_table_count() function fails if and only if desired ruleset does not exists. 2177 int 2178 pfr_table_count(struct pfr_table *filter, int flags) 2179 { 2180 struct pf_ruleset *rs; 2181 2182 if (flags & PFR_FLAG_ALLRSETS) 2183 return (pfr_ktable_cnt); 2184 if (filter->pfrt_anchor[0]) { 2185 rs = pf_find_ruleset(filter->pfrt_anchor); 2186 return ((rs != NULL) ? rs->tables : -1); 2187 } 2188 return (pf_main_ruleset.tables); 2189 } I wonder if it would help if adjust a fatal() line in relayd to also capture table name and anchor it is trying to find. diff which adjusts a call to fatal is below. if you don't want to build the whole tree and do in-place build you will need to adjust CFLAGS and LDFLAGS. Something like that will be needed: cd /path/to/your/src/usr.sbin/relayd export CFLAGS='-I/path/to/your/src/sys -I/path/to/your/src/lib/libutil export LDFLAGS='-L /path/to/your/src/lib/libutil' make </snip> > > same logs on Backup firewall so far, but after a minute or so: > > Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table stats: > No such file or directory this is where I'd like to see what table relayd is trying to look up. The process 61766 then exits using call `exit(1)` on behalf of function fatal() > Jun 30 01:47:46 ll1 relayd[94434]: ca exiting, pid 94434 > Jun 30 01:47:46 ll1 relayd[83189]: ca exiting, pid 83189 > Jun 30 01:47:46 ll1 relayd[9023]: ca exiting, pid 9023 > Jun 30 01:47:46 ll1 relayd[89820]: ca exiting, pid 89820 > Jun 30 01:47:46 ll1 relayd[94676]: ca exiting, pid 94676 > Jun 30 01:47:46 ll1 relayd[1820]: hce exiting, pid 1820 > Jun 30 01:47:46 ll1 relayd[52103]: lost child: pid 61766 exited abnormally parent relayd process noticed the child took exit(1) because it could not find table. once you'll be able to run patched relayd can you try to reproduce the issue? also it will help if you will collect additional data. pfctl -vsA > anchors-before # reproduce the issue wait for relayd to exit/crrash pfctl -vsA > anchors-after those data, together with output from adjusted call to fatal() should help us to better understand what's going on. thanks for your help regards sashan --------8<---------------8<---------------8<------------------8<-------- diff --git a/usr.sbin/relayd/pfe_filter.c b/usr.sbin/relayd/pfe_filter.c index 347048ece56..e1ae050b768 100644 --- a/usr.sbin/relayd/pfe_filter.c +++ b/usr.sbin/relayd/pfe_filter.c @@ -632,7 +632,8 @@ check_table(struct relayd *env, struct rdr *rdr, struct table *table) goto toolong; if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1) - fatal("%s: cannot get table stats", __func__); + fatal("%s: cannot get table stats for %s@%s", __func__, + io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor); return (tstats.pfrts_match); >