On 2018-01-23 08:26:43, danial wrote:
> I had something similar happening. In my case I solved it by disabling NAT-T
> on one end.

Unfortunately disabling NAT-T did not have any effect, the issue has
occurred 2 additional times since restarting with -T. There is no NAT or
firewall between these machines aside from PF with everything permitted
to/from peer IPs (confirmed with pflog that nothing legit is getting
blocked). Thanks though.

I was able to get iked running in debug mode outputting with timestamps
to a file. I'll try to summarize and if anyone is willing to look at
the full debug logs I'm happy to send, but it is a lot.

There seems to be two slightly different symptoms, one where iked does
not gracefully restart and has to be kill -9'd, and another where
everything recovers on its own after an ikesa rekey. In both cases an 
ikev2_msg_retransmit_timeout is eventually logged for one msgid that is 
not received on the other side, even though the underlying connectivity
between the machines is OK. Sometimes tunnel traffic drops minutes
before that retransmit_timeout and flow removal, other times tunnel
traffic passes until the flows are freed.

Example 1 - "Site A" initiates a rekey to "Site B":

Feb 15 23:28:48 pfkey_process: SA 0xb241ab32 is expired, pending rekeying
Feb 15 23:28:48 ikev2_send_create_child_sa: rekeying ESP spi 0xb241ab32
Feb 15 23:28:48 ikev2_pld_parse: header ispi 0x922bc52299efed0e rspi 
0x81f6d03c457134b7 nextpayload SK version 0x20 exchange CREATE_CHILD_SA flags 
0x00 msgid 80 length 720 response 0
Feb 15 23:28:48 ikev2_msg_send: CREATE_CHILD_SA request from a.a.a.a:500 to 
b.b.b.b:500 msgid 80, 720 bytes
Feb 15 23:28:48 ikev2_recv: CREATE_CHILD_SA response from initiator b.b.b.b:500 
to a.a.a.a:500 policy 'siteb' id 80, 736 bytes
Feb 15 23:28:48 ikev2_recv: ispi 0x922bc52299efed0e rspi 0x81f6d03c457134b7
Feb 15 23:28:48 ikev2_recv: updated SA to peer b.b.b.b:500 local a.a.a.a:500
[...]
Feb 15 23:28:48 ikev2_msg_send: INFORMATIONAL request from a.a.a.a:500 to 
b.b.b.b:500 msgid 81, 80 bytes
[...]
Feb 15 23:30:55 ikev2_msg_retransmit_timeout: retransmit limit reached for 
msgid 81
Feb 15 23:30:55 sa_free: ispi 0x922bc52299efed0e rspi 0x81f6d03c457134b7
Feb 15 23:30:55 config_free_proposals: free 0x13b9a0591300
Feb 15 23:30:55 config_free_proposals: free 0x13b9950e9180
Feb 15 23:30:55 config_free_childsas: free 0x13b9a98b3e00
Feb 15 23:30:55 config_free_childsas: free 0x13b9798f7000
Feb 15 23:30:55 sa_free_flows: free 0x13b9afcdec00
[... all flows removed, traffic drops ...]

iked on the other side ("Site B") receives and responds to the
CREATE_CHILD_SA request (msgid 80) but the INFORMATIONAL request msgid
81 above is not seen on the other side. A ctrl-c of iked running in the
foreground stopped the control and ca processes, but parent and ikev2
processes hung and both needed a kill -9 like before. I attached ktrace
to the parent and ikev2 PIDs but no operations were logged on either. Is
there something else I should do in this scenario to debug the hung
processes?

--------------------------------------------------------------

Example 2 - This one eventually recovered on its own with a full ikesa
rekey. Does the "pfkey_process is busy" point to anything? This is only
a very small portion of the debug log, but tunnel traffic stopped
passing at at 05:44 which is minutes before the retransmit timeout and
the flows were actually removed.

Site A:
Feb 19 05:44:51 pfkey_process: SA 0x1f00e21b is expired, pending rekeying
Feb 19 05:45:19 pfkey_process: SA 0x3382f769 is expired, pending rekeying
Feb 19 05:45:35 pfkey_process: SA 0x7b59c4c8 is expired, pending rekeying
Feb 19 05:45:49 pfkey_process: SA 0x985fd6fa is expired, pending rekeying
Feb 19 05:46:03 pfkey_process: SA 0xbc92b3b5 is expired, pending rekeying
Feb 19 05:46:21 ikev2_msg_send: INFORMATIONAL request from a.a.a.a:500 to 
b.b.b.b:500 msgid 649, 80 bytes
[...]
Feb 19 05:47:49 ikev2_msg_send: INFORMATIONAL request from a.a.a.a:500 to 
b.b.b.b:500 msgid 650, 80 bytes
[...]
Feb 19 05:49:55 ikev2_msg_retransmit_timeout: retransmit limit reached for 
msgid 650

Site B:
Again, msgid 649 is received and responded to but not the very next
msgid 650:

Feb 19 05:46:21 ikev2_pld_parse: header ispi 0x43855e1590d3e29a rspi 
0x8a04b23198062451 nextpayload SK version 0x20 exchange INFORMATIONAL flags 
0x08 msgid 649 length 80 response 0
Feb 19 05:46:21 ikev2_pld_parse: header ispi 0x43855e1590d3e29a rspi 
0x8a04b23198062451 nextpayload SK version 0x20 exchange INFORMATIONAL flags 
0x20 msgid 649 length 80 response 1
Feb 19 05:46:21 ikev2_msg_send: INFORMATIONAL response from b.b.b.b:500 to 
a.a.a.a:500 msgid 649, 80 bytes

# grep "Feb 19" iked_debug.log | grep "msgid 650"
#
# grep "Feb 19" iked_debug.log | grep -B1 "is busy"
Feb 19 05:44:50 pfkey_process: SA 0x1f00e21b is expired, pending rekeying
Feb 19 05:44:50 pfkey_dispatch: pfkey_process is busy, retry later
--
Feb 19 05:45:19 pfkey_process: SA 0x3382f769 is expired, pending rekeying
Feb 19 05:45:19 pfkey_dispatch: pfkey_process is busy, retry later
--
Feb 19 05:45:34 pfkey_process: SA 0x7b59c4c8 is expired, pending rekeying
Feb 19 05:45:34 pfkey_dispatch: pfkey_process is busy, retry later
--
Feb 19 05:45:48 pfkey_process: SA 0x985fd6fa is expired, pending rekeying
Feb 19 05:45:48 pfkey_dispatch: pfkey_process is busy, retry later
--
Feb 19 05:46:02 pfkey_process: SA 0xbc92b3b5 is expired, pending rekeying
Feb 19 05:46:02 pfkey_dispatch: pfkey_process is busy, retry later
#

Reply via email to