Re: relayd crashing some times

Kapetanakis Giannis Fri, 30 Jun 2023 03:41:44 -0700

Probably the latest libutil cause of ibuf_data()

will test and report.


thanx,

G

On 30/06/2023 13:31, Kapetanakis Giannis wrote:
> The program does not terminate by system.
>
> It indeed exits from that fatal() function in check_table()
>
> I will add the table print and try to reproduce at some time. It might take a 
> while cause I have to shutdown completely our mail for this to test.
>
> Will report back.
>
> Does it need to build with the latest libutil or I can compile with system's 
> version?
>
> G
>
> On 30/06/2023 12:57, Alexandr Nedvedicky wrote:
>> Hello,
>>
>> I'm not familiar enough with relayd, so perhaps other folks
>> here might provide better way to troubleshoot the issue.
>>
>> On Fri, Jun 30, 2023 at 11:10:44AM +0300, Kapetanakis Giannis wrote:
>>> Hello,
>>>
>>> This happened to me twice.
>>> OpenBSD 7.3 with syspatches.
>>>
>>> I have a pair of carp/pfsync/pf/relayd firewall-load balancers with many 
>>> redirects (only) on them.
>>>
>>> I wanted to do maintenance of some hosts bellow load balancers.
>>> After a while relayd crashed on Master firewall only.
>>     when you say crash: does it mean the relayd was terminated
>>     by system because of memory/stack/program violation?
>>     if it is the case is there any chance to collect core file?
>>
>>     or was it rather voluntary exit, when relayd called its function fatal()
>>
>>     the 'No such file or director' error code, which comes from 
>> DIOCRGETTSTATS
>>     ioctl() come from line 1746 in sys/net/pf_table.c:
>>
>> 1731 int
>> 1732 pfr_get_tstats(struct pfr_table *filter, struct pfr_tstats *tbl, int 
>> *size,
>> 1733         int flags)
>> 1734 {
>> 1735         struct pfr_ktable       *p;
>> 1736         struct pfr_ktableworkq   workq;
>> 1737         int                      n, nn;
>> 1738         time_t                   tzero = gettime();
>> 1739
>> 1740         /* XXX PFR_FLAG_CLSTATS disabled */
>> 1741         ACCEPT_FLAGS(flags, PFR_FLAG_ALLRSETS);
>> 1742         if (pfr_fix_anchor(filter->pfrt_anchor))
>> 1743                 return (EINVAL);
>> 1744         n = nn = pfr_table_count(filter, flags);
>> 1745         if (n < 0)
>> 1746                 return (ENOENT);
>>
>>
>>     the pfr_table_count() function fails if and only if desired ruleset
>>     does not exists.
>>
>> 2177 int
>> 2178 pfr_table_count(struct pfr_table *filter, int flags)
>> 2179 {
>> 2180         struct pf_ruleset *rs;
>> 2181
>> 2182         if (flags & PFR_FLAG_ALLRSETS)
>> 2183                 return (pfr_ktable_cnt);
>> 2184         if (filter->pfrt_anchor[0]) {
>> 2185                 rs = pf_find_ruleset(filter->pfrt_anchor);
>> 2186                 return ((rs != NULL) ? rs->tables : -1);
>> 2187         }
>> 2188         return (pf_main_ruleset.tables);
>> 2189 }
>>
>>     I wonder if it would help if adjust a fatal() line in relayd
>>     to also capture table name and anchor it is trying to find.
>>     diff which adjusts a call to fatal is below.
>>
>>     if you don't want to build the whole tree and do in-place
>>     build you will need to adjust CFLAGS and LDFLAGS. Something
>>     like that will be needed:
>>
>>      cd /path/to/your/src/usr.sbin/relayd
>>      export CFLAGS='-I/path/to/your/src/sys -I/path/to/your/src/lib/libutil
>>      export LDFLAGS='-L /path/to/your/src/lib/libutil'
>>      make
>>
>>
>> </snip>
>>
>>> same logs on Backup firewall so far, but after a minute or so:
>>>
>>> Jun 30 01:47:46 ll1 relayd[61766]: pfe: check_table: cannot get table 
>>> stats: No such file or directory
>>     this is where I'd like to see what table relayd is trying
>>     to look up. The process 61766 then exits using call `exit(1)` 
>>     on behalf of function fatal()
>>
>>> Jun 30 01:47:46 ll1 relayd[94434]: ca exiting, pid 94434
>>> Jun 30 01:47:46 ll1 relayd[83189]: ca exiting, pid 83189
>>> Jun 30 01:47:46 ll1 relayd[9023]: ca exiting, pid 9023
>>> Jun 30 01:47:46 ll1 relayd[89820]: ca exiting, pid 89820
>>> Jun 30 01:47:46 ll1 relayd[94676]: ca exiting, pid 94676
>>> Jun 30 01:47:46 ll1 relayd[1820]: hce exiting, pid 1820
>>> Jun 30 01:47:46 ll1 relayd[52103]: lost child: pid 61766 exited abnormally
>>     parent relayd process noticed the child took exit(1)
>>     because it could not find table. 
>>
>>     once you'll be able to run patched relayd can you try to reproduce
>>     the issue?
>>
>>     also it will help if you will collect additional data.
>>
>>      pfctl -vsA > anchors-before
>>      # reproduce the issue wait for relayd to exit/crrash
>>      pfctl -vsA > anchors-after
>>
>>     those data, together with output from adjusted call
>>     to fatal() should help us to better understand
>>     what's going on.
>>
>> thanks for your help
>> regards
>> sashan
>>
>> --------8<---------------8<---------------8<------------------8<--------
>> diff --git a/usr.sbin/relayd/pfe_filter.c b/usr.sbin/relayd/pfe_filter.c
>> index 347048ece56..e1ae050b768 100644
>> --- a/usr.sbin/relayd/pfe_filter.c
>> +++ b/usr.sbin/relayd/pfe_filter.c
>> @@ -632,7 +632,8 @@ check_table(struct relayd *env, struct rdr *rdr, struct 
>> table *table)
>>              goto toolong;
>>  
>>      if (ioctl(env->sc_pf->dev, DIOCRGETTSTATS, &io) == -1)
>> -            fatal("%s: cannot get table stats", __func__);
>> +            fatal("%s: cannot get table stats for %s@%s", __func__,
>> +                io.pfrio_table.pfrt_name, io.pfrio_table.pfrt_anchor);
>>  
>>      return (tstats.pfrts_match);
>>

Re: relayd crashing some times

Reply via email to