The section 2 describing RFC4306 crash recovery is not complete. It does not include the normal processing happining on the peer that is not rebooting.
I suggest adding following text: ---------------------------------------------------------------------- When the one peer loses state or reboots it might not be able to recover immediately (especially in case of reboot). This means that at first the peer just goes silent, i.e. does not send or respond to any messages. Conforming IKEv2 implementation will detect this situation and follow the rules given in the section 2.4: "If there has only been outgoing traffic on all of the SAs associated with an IKE SA, it is essential to confirm liveness of the other endpoint to avoid black holes. If no cryptographically protected messages have been received on an IKE SA or any of its Child SAs recently, the system needs to perform a liveness check in order to prevent sending messages to a dead peer." I.e. the peer usually will start liveness checks even before the other end is sending INVALID_SPI notification, as it detected that the other end is not sending any packets anymore while it is still rebooting or recovering from the situation. This means that the several minutes recovery period is overlaping the actual recover time of the other peer, i.e. if the security gateway requires several minutes to boot up from the crash then the other peers have already finished their liveness checks before the crashing peer even has change to send INVALID_SPI notifications. There are cases where the peer looses state and is able to recover immediately, in those cases it might take several minutes to recover. Note, that IKEv2 specification specifically leaves number of retries and lengths of timeouts out from the specification, as they do not affect interoperability. This means that implementations are allowed to use the hints provided by the INVALID_SPI messages as hints that will shorten those timeouts (i.e. different environment and situation requiring different rules). Good existing IKEv2 implementations already do that (i.e. both shorten timeouts or limit number of retries) based on that kind of hints and also start liveness checks quickly after the other end goes silent. ---------------------------------------------------------------------- The final paragraph saying: Those "at least several minutes" are a time during which both peers are active, but IPsec cannot be used. is incorrect, as it is only true when the crashed peer recoverd instantenously. In normal case most of that time is actually overlaping the recovery time of the peer. -- The protocol currently says that: Supporting implementations will send a notification, called a "QCD token", as described in Section 4.1 in the last IKE_AUTH exchange messages. These are the final IKE_AUTH request and final IKE_AUTH response that contain the AUTH payloads. This is very differnet compared to all other processing, usually this kind of payloads are put to the same packet that contains traffic selectors etc. Is there some reason why this is done this way? -- Also do we really need the QCD token for the initiator too? The initiator has already proven to be able to create the IKE SA on its own, and it will have enough information to recreate the IKE SA after the boot. Responder usually does not have enough information to be able to recrete the IKE SA on its own after reboot, as it might not for example know anymore what was the peer address where the IKE SA was connected to when it just has IP packet it needs to forward to that peer. The initiator must already have that information as he was able to trigger IPsec SA creation in the first place based on the ip packet. I think it would simplify the implementations and the protocol by just limiting that only responders can be token makers without loosing any of the functionality. -- Section 7.4 is mostly wrong. The default retransmission policy needed for mobike cases is much, much longer than what is needed in normal case. When mobike switches from one interface to the another there might be very long delays because of this (for example the device first needs to notice that old interface does not work anymore, and then perhaps it needs to run dhcp and other link related protocols on the new interface before it can even try it and all those take a long time). For example in our implementation the mobike uses MUCH longer timeouts just to make sure we do not time out the IKE exchanges while we are trying to go through all possible interfaces etc. Because of those even longer timeouts there is very good reason to shorten those timeouts in case we get any feedback back from the other end (i.e. INVALID_SPI notifications). The timeouts used in different situations even in the same implementation needs to be different. In our case when you enable mobike the number of retries used is more than 2 times what it is if you do not turn mobike on. Also the last paragraph again assumes that the peer staying up didn't start liveness check almost immediately when the crashing peer crashed. This is something that is already part of the standard IKEv2 specification, so implementions need to do that. This means the timeout starts from the time of the crash, not from the time when the gateway is up again. Anyways as all this is standard IKEv2 already it does not belong here in the alternative solutions section, but belongs as part of the section 2. -- Section 8 again ignores the IKEv2 text saying: "If there has only been outgoing traffic on all of the SAs associated with an IKE SA, it is essential to confirm liveness of the other endpoint to avoid black holes. If no cryptographically protected messages have been received on an IKE SA or any of its Child SAs recently, the system needs to perform a liveness check in order to prevent sending messages to a dead peer." Especially the text: A failed gateway may go undetected for as long as the lifetime of a child SA, because IPsec does not have packet acknowledgement, and applications cannot signal the IPsec layer that the tunnel "does not work". If the gateway has failed then if there is ANY traffic on any of the IPsec SAs then that means that from the other peers point of view there is only outgoing traffic, thus it needs to do liveness check to verify that the other end is alive. Thus the failed gateway cannot really go undetected for as long as the lifetime of child SA, unless the lifetimes is in order of few minutes :-) I know there are implementations who do not implement that part of the IKEv2 specification, but that does not mean that the part is not there. We should not write or specifications to cover broken implementations, but try to assume that implementations are following the IKEv2 specification. Note that the IKEv2 text does not have any conditionals there, it says that "...the system needs to perform a liveness check...". It does not say it may, or even should do it, it says it needs to be done. Also I think the picture itself is bit incorrect, the exchange after the reboot should probably be: ---- Reboot ----- HDR, SK {} --> <-- HDR, N(INVALID_IKE_SPI), N(QCD_TOKEN) I.e I assume the first packet is normal liveness check, and the reply that is normal INVALID_IKE_SPI with QCD_TOKEN. -- In section 9.1. it says that inter-domain VPN gateways should do both, but I think that inter-domain VPN gateways does not really need this specification as all, as they by configuration do know the other ends IP-addresses etc, thus when the inter-domain VPN gateway gets up, it can immediately create the IKE SAs needed based on the configuration. This is in the case where either end of the inter-domain VPN gateway can act as a initiator, i.e. no EAP is used, and neither is behind the NAT. If one of the inter-domain VPN gateway is behind restricted NAT, then it is more or less similar to the remote-access client case (i.e. only that end can initiate connections), and as the other peer cannot initiate connections to the gw behind NAT, there is no point of supporting token taker on that end. -- kivi...@iki.fi _______________________________________________ IPsec mailing list IPsec@ietf.org https://www.ietf.org/mailman/listinfo/ipsec