Probably the latest libutil cause of ibuf_data() will test and report.
thanx, G On 30/06/2023 13:31, Kapetanakis Giannis wrote: > The program does not terminate by system. > > It indeed exits from that fatal() function in check_table() > > I will add the table print and try to reproduce at some time. It might take a > while cause I have to shutdown completely our mail for this to test. > > Will report back. > > Does it need to build with the latest libutil or I can compile with system's > version? > > G > > On 30/06/2023 12:57, Alexandr Nedvedicky wrote: >> Hello, >> >> I'm not familiar enough with relayd, so perhaps other folks >> here might provide better way to troubleshoot the issue. >> >> On Fri, Jun 30, 2023 at 11:10:44AM +0300, Kapetanakis Giannis wrote: >>> Hello, >>> >>> This happened to me twice. >>> OpenBSD 7.3 with syspatches. >>> >>> I have a pair of carp/pfsync/pf/relayd firewall-load balancers with many >>> redirects (only) on them. >>> >>> I wanted to do maintenance of some hosts bellow load balancers. >>> After a while relayd crashed on Master firewall only. >> when you say crash: does it mean the relayd was terminated >> by system because of memory/stack/program violation? >> if it is the case is there any chance to collect core file? >> >> or was it rather voluntary exit, when relayd called its function fatal() >> >> the 'No such file or director' error code, which comes from >> DIOCRGETTSTATS >> ioctl() come from line 1746 in sys/net/pf_table.c: >> >> 1731 int >> 1732 pfr_get_tstats(struct pfr_table *filter, struct pfr_tstats *tbl, int >> *size, >> 1733 int flags) >> 1734 { >> 1735 struct pfr_ktable *p; >> 1736 struct pfr_ktableworkq workq; >> 1737 int n, nn; >> 1738 time_t tzero = gettime(); >> 1739 >> 1740 /* XXX PFR_FLAG_CLSTATS disabled */ >> 1741 ACCEPT_FLAGS(flags, PFR_FLAG_ALLRSETS); >> 1742 if (pfr_fix_anchor(filter->pfrt_anchor)) >> 1743 return (EINVAL); >> 1744 n = nn = pfr_table_count(filter, flags); >> 1745 if (n < 0) >> 1746 return (ENOENT); >> >> >> the pfr_table_count() function fails if and only if desired ruleset >> does not exists. >> >> 2177 int >> 2178 pfr_table_count(struct pfr_table *filter, int flags) >> 2179 { >> 2180 struct pf_ruleset *rs; >> 2181 >> 2182 if (flags & PFR_FLAG_ALLRSETS) >> 2183 return (pfr_ktable_cnt); >> 2184 if (filter->pfrt_anchor[0]) { >> 2185 rs = pf_find_ruleset(filter->pfrt_anchor); >> 2186 return ((rs != NULL) ? rs->tables : -1); >> 2187 } >> 2188 return (pf_main_ruleset.tables); >> 2189 } >> >> I wonder if it would help if adjust a fatal() line in relayd >> to also capture table name and anchor it is trying to find. >> diff which adjusts a call to fatal is below. >> >> if you don't want to build the whole tree and do in-place >> build you will need to adjust CFLAGS and LDFLAGS. Something >> like that will be needed: >> >> cd /path/to/your/src/usr.sbin/relayd >> export CFLAGS='-I/path/to/your/src/sys -I/path/to/your/src/lib/libutil >> export LDFLAGS='-L /path/to/your/src/lib/libutil' >> make >> >> >> </snip> >> >>> same logs on Backup firewall so far, but after a minute or so: >>> >>> Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table >>> stats: No such file or directory >> this is where I'd like to see what table relayd is trying >> to look up. The process 61766 then exits using call `exit(1)` >> on behalf of function fatal() >> >>> Jun 30 01:47:46 ll1 relayd[94434]: ca exiting, pid 94434 >>> Jun 30 01:47:46 ll1 relayd[83189]: ca exiting, pid 83189 >>> Jun 30 01:47:46 ll1 relayd[9023]: ca exiting, pid 9023 >>> Jun 30 01:47:46 ll1 relayd[89820]: ca exiting, pid 89820 >>> Jun 30 01:47:46 ll1 relayd[94676]: ca exiting, pid 94676 >>> Jun 30 01:47:46 ll1 relayd[1820]: hce exiting, pid 1820 >>> Jun 30 01:47:46 ll1 relayd[52103]: lost child: pid 61766 exited abnormally >> parent relayd process noticed the child took exit(1) >> because it could not find table. >> >> once you'll be able to run patched relayd can you try to reproduce >> the issue? >> >> also it will help if you will collect additional data. >> >> pfctl -vsA > anchors-before >> # reproduce the issue wait for relayd to exit/crrash >> pfctl -vsA > anchors-after >> >> those data, together with output from adjusted call >> to fatal() should help us to better understand >> what's going on. >> >> thanks for your help >> regards >> sashan >> >> --------8<---------------8<---------------8<------------------8<-------- >> diff --git a/usr.sbin/relayd/pfe_filter.c b/usr.sbin/relayd/pfe_filter.c >> index 347048ece56..e1ae050b768 100644 >> --- a/usr.sbin/relayd/pfe_filter.c >> +++ b/usr.sbin/relayd/pfe_filter.c >> @@ -632,7 +632,8 @@ check_table(struct relayd *env, struct rdr *rdr, struct >> table *table) >> goto toolong; >> >> if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1) >> - fatal("%s: cannot get table stats", __func__); >> + fatal("%s: cannot get table stats for %s@%s", __func__, >> + io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor); >> >> return (tstats.pfrts_match); >>