On Fri, Jun 30, 2023 at 11:57:06AM +0200, Alexandr Nedvedicky wrote: > Hello, > > I'm not familiar enough with relayd, so perhaps other folks > here might provide better way to troubleshoot the issue. > > On Fri, Jun 30, 2023 at 11:10:44AM +0300, Kapetanakis Giannis wrote: > > Hello, > > > > This happened to me twice. > > OpenBSD 7.3 with syspatches. > > > > I have a pair of carp/pfsync/pf/relayd firewall-load balancers with many > > redirects (only) on them. > > > > I wanted to do maintenance of some hosts bellow load balancers. > > After a while relayd crashed on Master firewall only. > > when you say crash: does it mean the relayd was terminated > by system because of memory/stack/program violation? > if it is the case is there any chance to collect core file? > > or was it rather voluntary exit, when relayd called its function fatal() > > the 'No such file or director' error code, which comes from DIOCRGETTSTATS > ioctl() come from line 1746 in sys/net/pf_table.c: > > 1731 int > 1732 pfr_get_tstats(struct pfr_table *filter, struct pfr_tstats *tbl, int > *size, > 1733 int flags) > 1734 { > 1735 struct pfr_ktable *p; > 1736 struct pfr_ktableworkq workq; > 1737 int n, nn; > 1738 time_t tzero = gettime(); > 1739 > 1740 /* XXX PFR_FLAG_CLSTATS disabled */ > 1741 ACCEPT_FLAGS(flags, PFR_FLAG_ALLRSETS); > 1742 if (pfr_fix_anchor(filter->pfrt_anchor)) > 1743 return (EINVAL); > 1744 n = nn = pfr_table_count(filter, flags); > 1745 if (n < 0) > 1746 return (ENOENT); > > > the pfr_table_count() function fails if and only if desired ruleset > does not exists. > > 2177 int > 2178 pfr_table_count(struct pfr_table *filter, int flags) > 2179 { > 2180 struct pf_ruleset *rs; > 2181 > 2182 if (flags & PFR_FLAG_ALLRSETS) > 2183 return (pfr_ktable_cnt); > 2184 if (filter->pfrt_anchor[0]) { > 2185 rs = pf_find_ruleset(filter->pfrt_anchor); > 2186 return ((rs != NULL) ? rs->tables : -1); > 2187 } > 2188 return (pf_main_ruleset.tables); > 2189 } > > I wonder if it would help if adjust a fatal() line in relayd > to also capture table name and anchor it is trying to find. > diff which adjusts a call to fatal is below. > > if you don't want to build the whole tree and do in-place > build you will need to adjust CFLAGS and LDFLAGS. Something > like that will be needed: > > cd /path/to/your/src/usr.sbin/relayd > export CFLAGS='-I/path/to/your/src/sys -I/path/to/your/src/lib/libutil > export LDFLAGS='-L /path/to/your/src/lib/libutil' > make > > > </snip> > > > > > same logs on Backup firewall so far, but after a minute or so: > > > > Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table > > stats: No such file or directory > this is where I'd like to see what table relayd is trying > to look up. The process 61766 then exits using call `exit(1)` > on behalf of function fatal() > > > Jun 30 01:47:46 ll1 relayd[94434]: ca exiting, pid 94434 > > Jun 30 01:47:46 ll1 relayd[83189]: ca exiting, pid 83189 > > Jun 30 01:47:46 ll1 relayd[9023]: ca exiting, pid 9023 > > Jun 30 01:47:46 ll1 relayd[89820]: ca exiting, pid 89820 > > Jun 30 01:47:46 ll1 relayd[94676]: ca exiting, pid 94676 > > Jun 30 01:47:46 ll1 relayd[1820]: hce exiting, pid 1820 > > Jun 30 01:47:46 ll1 relayd[52103]: lost child: pid 61766 exited abnormally > parent relayd process noticed the child took exit(1) > because it could not find table. > > once you'll be able to run patched relayd can you try to reproduce > the issue? > > also it will help if you will collect additional data. > > pfctl -vsA > anchors-before > # reproduce the issue wait for relayd to exit/crrash > pfctl -vsA > anchors-after > > those data, together with output from adjusted call > to fatal() should help us to better understand > what's going on. > > thanks for your help > regards > sashan > > --------8<---------------8<---------------8<------------------8<-------- > diff --git a/usr.sbin/relayd/pfe_filter.c b/usr.sbin/relayd/pfe_filter.c > index 347048ece56..e1ae050b768 100644 > --- a/usr.sbin/relayd/pfe_filter.c > +++ b/usr.sbin/relayd/pfe_filter.c > @@ -632,7 +632,8 @@ check_table(struct relayd *env, struct rdr *rdr, struct > table *table) > goto toolong; > > if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1) > - fatal("%s: cannot get table stats", __func__); > + fatal("%s: cannot get table stats for %s@%s", __func__, > + io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor); > > return (tstats.pfrts_match); >
I agree printing this info is useful. OK claudio@ to improve the error message. -- :wq Claudio