On 2018-01-23 08:26:43, danial wrote: > I had something similar happening. In my case I solved it by disabling NAT-T > on one end.
Unfortunately disabling NAT-T did not have any effect, the issue has occurred 2 additional times since restarting with -T. There is no NAT or firewall between these machines aside from PF with everything permitted to/from peer IPs (confirmed with pflog that nothing legit is getting blocked). Thanks though. I was able to get iked running in debug mode outputting with timestamps to a file. I'll try to summarize and if anyone is willing to look at the full debug logs I'm happy to send, but it is a lot. There seems to be two slightly different symptoms, one where iked does not gracefully restart and has to be kill -9'd, and another where everything recovers on its own after an ikesa rekey. In both cases an ikev2_msg_retransmit_timeout is eventually logged for one msgid that is not received on the other side, even though the underlying connectivity between the machines is OK. Sometimes tunnel traffic drops minutes before that retransmit_timeout and flow removal, other times tunnel traffic passes until the flows are freed. Example 1 - "Site A" initiates a rekey to "Site B": Feb 15 23:28:48 pfkey_process: SA 0xb241ab32 is expired, pending rekeying Feb 15 23:28:48 ikev2_send_create_child_sa: rekeying ESP spi 0xb241ab32 Feb 15 23:28:48 ikev2_pld_parse: header ispi 0x922bc52299efed0e rspi 0x81f6d03c457134b7 nextpayload SK version 0x20 exchange CREATE_CHILD_SA flags 0x00 msgid 80 length 720 response 0 Feb 15 23:28:48 ikev2_msg_send: CREATE_CHILD_SA request from a.a.a.a:500 to b.b.b.b:500 msgid 80, 720 bytes Feb 15 23:28:48 ikev2_recv: CREATE_CHILD_SA response from initiator b.b.b.b:500 to a.a.a.a:500 policy 'siteb' id 80, 736 bytes Feb 15 23:28:48 ikev2_recv: ispi 0x922bc52299efed0e rspi 0x81f6d03c457134b7 Feb 15 23:28:48 ikev2_recv: updated SA to peer b.b.b.b:500 local a.a.a.a:500 [...] Feb 15 23:28:48 ikev2_msg_send: INFORMATIONAL request from a.a.a.a:500 to b.b.b.b:500 msgid 81, 80 bytes [...] Feb 15 23:30:55 ikev2_msg_retransmit_timeout: retransmit limit reached for msgid 81 Feb 15 23:30:55 sa_free: ispi 0x922bc52299efed0e rspi 0x81f6d03c457134b7 Feb 15 23:30:55 config_free_proposals: free 0x13b9a0591300 Feb 15 23:30:55 config_free_proposals: free 0x13b9950e9180 Feb 15 23:30:55 config_free_childsas: free 0x13b9a98b3e00 Feb 15 23:30:55 config_free_childsas: free 0x13b9798f7000 Feb 15 23:30:55 sa_free_flows: free 0x13b9afcdec00 [... all flows removed, traffic drops ...] iked on the other side ("Site B") receives and responds to the CREATE_CHILD_SA request (msgid 80) but the INFORMATIONAL request msgid 81 above is not seen on the other side. A ctrl-c of iked running in the foreground stopped the control and ca processes, but parent and ikev2 processes hung and both needed a kill -9 like before. I attached ktrace to the parent and ikev2 PIDs but no operations were logged on either. Is there something else I should do in this scenario to debug the hung processes? -------------------------------------------------------------- Example 2 - This one eventually recovered on its own with a full ikesa rekey. Does the "pfkey_process is busy" point to anything? This is only a very small portion of the debug log, but tunnel traffic stopped passing at at 05:44 which is minutes before the retransmit timeout and the flows were actually removed. Site A: Feb 19 05:44:51 pfkey_process: SA 0x1f00e21b is expired, pending rekeying Feb 19 05:45:19 pfkey_process: SA 0x3382f769 is expired, pending rekeying Feb 19 05:45:35 pfkey_process: SA 0x7b59c4c8 is expired, pending rekeying Feb 19 05:45:49 pfkey_process: SA 0x985fd6fa is expired, pending rekeying Feb 19 05:46:03 pfkey_process: SA 0xbc92b3b5 is expired, pending rekeying Feb 19 05:46:21 ikev2_msg_send: INFORMATIONAL request from a.a.a.a:500 to b.b.b.b:500 msgid 649, 80 bytes [...] Feb 19 05:47:49 ikev2_msg_send: INFORMATIONAL request from a.a.a.a:500 to b.b.b.b:500 msgid 650, 80 bytes [...] Feb 19 05:49:55 ikev2_msg_retransmit_timeout: retransmit limit reached for msgid 650 Site B: Again, msgid 649 is received and responded to but not the very next msgid 650: Feb 19 05:46:21 ikev2_pld_parse: header ispi 0x43855e1590d3e29a rspi 0x8a04b23198062451 nextpayload SK version 0x20 exchange INFORMATIONAL flags 0x08 msgid 649 length 80 response 0 Feb 19 05:46:21 ikev2_pld_parse: header ispi 0x43855e1590d3e29a rspi 0x8a04b23198062451 nextpayload SK version 0x20 exchange INFORMATIONAL flags 0x20 msgid 649 length 80 response 1 Feb 19 05:46:21 ikev2_msg_send: INFORMATIONAL response from b.b.b.b:500 to a.a.a.a:500 msgid 649, 80 bytes # grep "Feb 19" iked_debug.log | grep "msgid 650" # # grep "Feb 19" iked_debug.log | grep -B1 "is busy" Feb 19 05:44:50 pfkey_process: SA 0x1f00e21b is expired, pending rekeying Feb 19 05:44:50 pfkey_dispatch: pfkey_process is busy, retry later -- Feb 19 05:45:19 pfkey_process: SA 0x3382f769 is expired, pending rekeying Feb 19 05:45:19 pfkey_dispatch: pfkey_process is busy, retry later -- Feb 19 05:45:34 pfkey_process: SA 0x7b59c4c8 is expired, pending rekeying Feb 19 05:45:34 pfkey_dispatch: pfkey_process is busy, retry later -- Feb 19 05:45:48 pfkey_process: SA 0x985fd6fa is expired, pending rekeying Feb 19 05:45:48 pfkey_dispatch: pfkey_process is busy, retry later -- Feb 19 05:46:02 pfkey_process: SA 0xbc92b3b5 is expired, pending rekeying Feb 19 05:46:02 pfkey_dispatch: pfkey_process is busy, retry later #