Hello. This is my first time requesting help and I initially sent this message 
last week, but it did not make it onto the mailing list.
I think that the message content may have been too long due to some packet 
traces which I had pasted into the message body,
so I've moved those to online pastes. I can also attach them as files if that 
is more appropriate.
I hope that my post and my question are formatted and formulated appropriately.

I have a vnet jail which sends and receives UDP packets with other hosts on the 
internet on ports 500 and 4500.
The jail has a private address on one side of an epair, and the other side is 
connected to a bridge on the host.
NAT is performed on the host with respect to the host's external interface and 
public IP address, which we can take to be 1.1.1.1.
Please see below for excerpts of the firewall rules and packet traces. NAT is 
configured to forward incoming UDP on 500 and 4500 to the jail.
This, and other parts of the ruleset, including other port redirections 
(though, no UDP redirections), work. Currently, there are no other port 
redirections.

One caveat with the description of the problem below: I am describing it as NAT 
failing to redirect the packets, but it's also possible that they are somehow 
being dropped.
Unfortunately I do not have a working IPFW log interface on this host (such as 
ipfw0). The log interface works on many other hosts, which are configured 
identically as far as I can tell, but on certain hosts it has never worked.
This may be helpful to diagnosing this issue, but ultimately I'm not sure if 
it's necessary, since there is presumably another underlying issue at hand.
In any case, I'm describing the issue as NAT failing to redirect packets.

Problem statement:
The problem is that on this specific host, on occasion (usually once every few 
weeks), NAT will begin to fail to redirect some or all UDP packets, but only 
with specific external host(s).
Other hosts are not at all affected. The external host(s) in question sometimes 
change each time that the issue happens.
Sometimes, looking at packet traces, it's clear that there are specific types 
of underlying application messages that are not being redirected, while some 
are (see example packet traces, below).
Unfortunately, there does not seem to be anything that identifies these packets 
except from an application standpoint. For example, they are often the same 
size as other packets that make it through.
Other times, all incoming UDP packets on those ports (from the specific hosts 
in question) are not redirected.
If certain packets do make it through, this usually corresponds to the 
application, an IKE client, getting to a certain specific point in its 
conversation with the IKE server; this seems to basically be the first 
step/round of messages.
For example, it appears that none of the packets marked as ikev2_auth[R] by 
tcpdump make it through.

Current workarounds:
- Identical rules:
One way to fix this temporarily is to issue another identical NAT statement 
(numbered 2), and creating IPFW rules at 445, 446, and 447 which are identical 
to those at 450, 451, and 452, but specified relative to the external host in 
question.
This usually fixes the issue, but only temporarily because eventually the issue 
will appear again; at that point, it is usually (but not always) fixed by 
removing those rules 445-447.

- Reboot:
Rebooting sometimes helps, or just moves the issue to a different host. 
Sometimes the external host in question stays the same across several reboots.
Typically, after several reboots, and possibly waiting several minutes, the 
issue will go away, at least temporarily.
I have not measured how much time is typically required to wait after 
rebooting, but it seems to be about 5-10 minutes. After that point, if it is 
not working, another reboot is required.
Sometimes, upon reboot, the issue will not be present for a short period of 
time, usually less than a minute, but then it will appear.
In the most recent case, for the external host in question which is considered 
in the packet traces below, it worked for 6 seconds before failing, two reboots 
in a row.

Notes about the firewall:
- This system has net.inet.ip.fw.one_pass=0
- The firewall rules are set at boot time by rc and only change if I explicitly 
do so.
- I have tried removing the reassembly rule, and I have also tried putting a 
skipto rule before the reassembly so that it does not apply to the UDP packets 
in question.
- I think that I've tried in the past to work around this by using natd with 
divert rules. I've just tried this, and natd fails to start with "aliasing 
address not given". I think this may be the same error I encountered in the 
past, which would explain why I had not set up natd as a workaround. We could 
try investigating this error, but it may not be necessary. If desired, I can 
provide my natd.conf.

Note regarding NAT on some external hosts:
This issue has also happened to the packets of one specific external host 
that's behind a NAT while not happening to the packets of another specific 
external host that is behind the same NAT.
In other words, they share the same IP address. In this case, even though the 
packets are arriving at this system from the same IP address, somehow only the 
packets from that specific host face this issue.
The two hosts are doing basically the same thing and are largely configured the 
same, so I would expect the conversation with one host to be indistinguishable 
from the conversation with another host,
except possibly for certain identifiers that appear in the packets; in other 
words, the specific host behind this NAT that the packets arrive from should 
only be distinguishable from these packets by examining the application layer.
Furthermore, sometimes it is one host that works, sometimes it is the other; 
indeed usually both work.
This might suggest that the underlying issue has something to do with NAT or 
IPFW state, amounting to something having to do with timing.
However, I have noticed that flushing the IPFW rules and re-applying them does 
not fix the issue.

Requested assistance:
- Some kind of DTrace script that I can use to trace what is happening to these 
packets when this issue occurs. Currently it is not happening, and it could be 
a week or more before it happens again, though I could possibly induce it by 
rebooting.
- Any other ideas for what could be happening or suggestions for diagnosing 
this.

Miscellaneous facts:
- The MTU on the external interface, bridge, and epair are all 1500

I am completely mystified by this issue, so I appreciate any help.
Thanks

KERNEL/SYSTEM CONFIGURATION:

I'm not sure if a custom kernel is needed anymore. This configuration has 
stayed mostly the same since more or less FreeBSD 11. It is the same as the 
configuration on all other hosts in question.
I don't remember when this issue first started, but it may have been after 
upgrading to 12.3, about a year ago.
I recall reading that were some changes to some parts of the networking code, 
which would have taken effect on this system in that update.
This system is running 12.3-RELEASE-p2 r371548, and unfortunately I don't have 
a record of what revision it was upgraded from last year, but it was something 
in the 360s.
All other hosts are running the same version and revision. I have experienced 
this issue on a few of the others, but the issue takes place overwhelmingly on 
this particular host.

include GENERIC
ident CUSTOM

options IPSEC
options VIMAGE
options RACCT
options RCTL
options IPFIREWALL
options IPFIREWALL_NAT
options LIBALIAS
options IPFIREWALL_VERBOSE
options IPFIREWALL_VERBOSE_LIMIT=0
options IPDIVERT

options AUDIT

options DUMMYNET
options NETGRAPH
options MROUTING
nooptions SCTP

device enc
device gre
device crypto
device netmap

+++++++++++++++++++++++++++++++++++++++++
FIREWALL RULES (ON HOST)

192.168.21.4 is the IP address of the jail in question.
I've removed the major part of the config that comes after the main NAT 
section, and also a few rules from before 540 which are unrelated.

++++

ipfw -q -f flush
ipfw table all destroy
add="ipfw add"

set -x

$add 1 reass all from any to any in

# default rules
$add 10 allow ip from any to any via lo0
$add 20 deny ip from any to 127.0.0.1/8
$add 30 deny ip from 127.0.0.1/8 to any
$add 40 deny ip from any to ::1
$add 50 deny ip from ::1 to any
$add 60 allow ipv6-icmp from :: to ff02::/16
$add 70 allow ipv6-icmp from fe80::/10 to fe80::/10
$add 80 allow ipv6-icmp from fe80::/10 to ff02::/16
$add 90 allow ipv6-icmp from any to any ip6 icmp6types 1
$add 100 allow ipv6-icmp from any to any ip6 icmp6types 2,135,136
$add 101 allow ipv6-icmp from any to any ip6 icmp6types 128,129

#$add 200 allow ip6 from any to any out via $extif keep-state

ipfw -q nat 1 config if $extif unreg_only reset \
redirect_port udp 192.168.21.4:500 500 \
redirect_port udp 192.168.21.4:4500 4500

$add 300 allow tcp from any to $exthost 22 via $extif setup keep-state

# $DENY_SKIP is a special rule which is a skipto to an area near the end of the 
ruleset where packets are diverted to a logging system and then dropped.
$add 400 $DENY_SKIP ip from any to 192.168.0.0/16 out via $extif
$add 401 $DENY_SKIP ip from 192.168.0.0/16 to any in via $extif
$add 402 $DENY_SKIP ip from any to 10.0.0.0/8 out via $extif
$add 403 $DENY_SKIP ip from 10.0.0.0/8 to any in via $extif

$add 450 nat 1 udp from any to any 500,4500 in via $extif
$add 451 nat 1 udp from any to any 500,4500 out via $extif
$add 452 allow udp from any to any 500,4500 via $extif

$add 500 nat 1 ip from any to any via $extif in

$add 505 check-state :default

$add 510 skipto 65000 tcp from any to any out via $extif keep-state
$add 520 skipto 65000 udp from any to any out via $extif keep-state
$add 530 skipto 65000 icmp from any to any out via $extif keep-state
$add 540 $DENY_SKIP ip from any to any out via $extif

[...]

$add 64900 deny log all from any to any
$add 65000 nat 1 ip from any to any via $extif out
$add 65534 allow ip from any to any

++++

+++++++++++++++++++++++++++++++++++++++++

In the following packet traces, 2.2.2.2 is the external host whose packets were 
experiencing this issue on the most recent occasion.
I've tried to be careful about making sure it's all correct, but there could be 
errors.

+++++++++++++++++++++++++++++++++++++++++
EXAMPLE PACKET TRACE - ALL INCOMING PACKETS (TRACE FROM EXTERNAL INTERFACE)

https://bsd.to/n75k

+++++++++++++++++++++++++++++++++++++++++
EXAMPLE PACKET TRACE - SUCCESSFULLY REDIRECTED/NON-DROPPED PACKETS (TRACE FROM 
JAIL SIDE OF EPAIR INTERFACE)

https://bsd.to/T8uz

+++++++++++++++++++++++++++++++++++++++++
EXAMPLE PACKET TRACE - DROPPED/NON-REDIRECTED PACKETS ONLY (DIFFERENCE BETWEEN 
THE TRACES)

https://bsd.to/ycVW

Reply via email to