[IPsec] Issue #197 - More text needed to describe RFC4306^H^H^H^H5996 recovery

Yoav Nir Thu, 21 Oct 2010 04:19:37 -0700

Hi all. Tero Kivinen sent the message included below to the mailing list on 
September 8th.


I am fine with this text.

Please read it thoroughly, and if there are no objections, I will incorporate 
it into the next version of the draft (which I intend to publish at the last 
possible moment on Monday)

Yoav

Begin forwarded message:

> From: Tero Kivinen <kivi...@iki.fi>
> Date: September 8, 2010 1:07:03 PM GMT+03:00
> To: "ipsec@ietf.org" <ipsec@ietf.org>
> Subject: [IPsec] Comments to draft-ietf-ipsecme-failure-detection-00
> 
> The section 2 describing RFC4306 crash recovery is not complete. It
> does not include the normal processing happining on the peer that is
> not rebooting.
> 
> I suggest adding following text:
> ----------------------------------------------------------------------
> When the one peer loses state or reboots it might not be able to
> recover immediately (especially in case of reboot). This means that at
> first the peer just goes silent, i.e. does not send or respond to any
> messages. Conforming IKEv2 implementation will detect this situation
> and follow the rules given in the section 2.4:
> 
>              "If there has only been outgoing traffic on all of
>   the SAs associated with an IKE SA, it is essential to confirm
>   liveness of the other endpoint to avoid black holes.  If no
>   cryptographically protected messages have been received on an IKE SA
>   or any of its Child SAs recently, the system needs to perform a
>   liveness check in order to prevent sending messages to a dead peer."
> 
> I.e. the peer usually will start liveness checks even before the other
> end is sending INVALID_SPI notification, as it detected that the other
> end is not sending any packets anymore while it is still rebooting or
> recovering from the situation.
> 
> This means that the several minutes recovery period is overlaping the
> actual recover time of the other peer, i.e. if the security gateway
> requires several minutes to boot up from the crash then the other
> peers have already finished their liveness checks before the crashing
> peer even has change to send INVALID_SPI notifications.
> 
> There are cases where the peer looses state and is able to recover
> immediately, in those cases it might take several minutes to recover.
> 
> Note, that IKEv2 specification specifically leaves number of
> retries and lengths of timeouts out from the specification, as they do
> not affect interoperability. This means that implementations are
> allowed to use the hints provided by the INVALID_SPI messages as hints
> that will shorten those timeouts (i.e. different environment and
> situation requiring different rules).
> 
> Good existing IKEv2 implementations already do that (i.e. both shorten
> timeouts or limit number of retries) based on that kind of hints and
> also start liveness checks quickly after the other end goes silent.
> ----------------------------------------------------------------------
> 
> The final paragraph saying:
> 
>   Those "at least several minutes" are a time during which both peers
>   are active, but IPsec cannot be used.
> 
> is incorrect, as it is only true when the crashed peer recoverd
> instantenously. In normal case most of that time is actually
> overlaping the recovery time of the peer.
> 
> --
> 
> The protocol currently says that:
> 
>   Supporting implementations will send a notification, called a "QCD
>   token", as described in Section 4.1 in the last IKE_AUTH exchange
>   messages.  These are the final IKE_AUTH request and final IKE_AUTH
>   response that contain the AUTH payloads.
> 
> This is very differnet compared to all other processing, usually this
> kind of payloads are put to the same packet that contains traffic
> selectors etc. Is there some reason why this is done this way?
> 
> --
> 
> Also do we really need the QCD token for the initiator too? The
> initiator has already proven to be able to create the IKE SA on its
> own, and it will have enough information to recreate the IKE SA after
> the boot. Responder usually does not have enough information to be
> able to recrete the IKE SA on its own after reboot, as it might not
> for example know anymore what was the peer address where the IKE SA
> was connected to when it just has IP packet it needs to forward to
> that peer. The initiator must already have that information as he was
> able to trigger IPsec SA creation in the first place based on the ip
> packet.
> 
> I think it would simplify the implementations and the protocol by just
> limiting that only responders can be token makers without loosing any
> of the functionality. 
> 
> --
> 
> Section 7.4 is mostly wrong.
> 
> The default retransmission policy needed for mobike cases is much,
> much longer than what is needed in normal case. When mobike switches
> from one interface to the another there might be very long delays
> because of this (for example the device first needs to notice that
> old interface does not work anymore, and then perhaps it needs to run
> dhcp and other link related protocols on the new interface before it
> can even try it and all those take a long time).
> 
> For example in our implementation the mobike uses MUCH longer timeouts
> just to make sure we do not time out the IKE exchanges while we are
> trying to go through all possible interfaces etc. Because of those
> even longer timeouts there is very good reason to shorten those
> timeouts in case we get any feedback back from the other end (i.e.
> INVALID_SPI notifications).
> 
> The timeouts used in different situations even in the same
> implementation needs to be different. In our case when you enable
> mobike the number of retries used is more than 2 times what it is if
> you do not turn mobike on.
> 
> Also the last paragraph again assumes that the peer staying up didn't
> start liveness check almost immediately when the crashing peer
> crashed. This is something that is already part of the standard IKEv2
> specification, so implementions need to do that. This means the
> timeout starts from the time of the crash, not from the time when the
> gateway is up again.
> 
> Anyways as all this is standard IKEv2 already it does not belong here
> in the alternative solutions section, but belongs as part of the
> section 2.
> 
> --
> 
> Section 8 again ignores the IKEv2 text saying:
> 
>              "If there has only been outgoing traffic on all of
>   the SAs associated with an IKE SA, it is essential to confirm
>   liveness of the other endpoint to avoid black holes.  If no
>   cryptographically protected messages have been received on an IKE SA
>   or any of its Child SAs recently, the system needs to perform a
>   liveness check in order to prevent sending messages to a dead peer."
> 
> Especially the text:
> 
>                                  A failed gateway may go undetected
>   for as long as the lifetime of a child SA, because IPsec does not
>   have packet acknowledgement, and applications cannot signal the IPsec
>   layer that the tunnel "does not work".  
> 
> If the gateway has failed then if there is ANY traffic on any of the
> IPsec SAs then that means that from the other peers point of view
> there is only outgoing traffic, thus it needs to do liveness check to
> verify that the other end is alive. Thus the failed gateway cannot
> really go undetected for as long as the lifetime of child SA, unless
> the lifetimes is in order of few minutes :-)
> 
> I know there are implementations who do not implement that part of the
> IKEv2 specification, but that does not mean that the part is not
> there. We should not write or specifications to cover broken
> implementations, but try to assume that implementations are following
> the IKEv2 specification.
> 
> Note that the IKEv2 text does not have any conditionals there, it says
> that "...the system needs to perform a liveness check...". It does not
> say it may, or even should do it, it says it needs to be done.
> 
> Also I think the picture itself is bit incorrect, the exchange after
> the reboot should probably be:
> 
>                ---- Reboot -----
> 
>       HDR, SK {}          -->
> 
>                        <--  HDR, N(INVALID_IKE_SPI), N(QCD_TOKEN)
> 
> 
> I.e I assume the first packet is normal liveness check, and the reply
> that is normal INVALID_IKE_SPI with QCD_TOKEN. 
> 
> --
> 
> In section 9.1. it says that inter-domain VPN gateways should do both,
> but I think that inter-domain VPN gateways does not really need this
> specification as all, as they by configuration do know the other ends
> IP-addresses etc, thus when the inter-domain VPN gateway gets up, it
> can immediately create the IKE SAs needed based on the configuration.
> This is in the case where either end of the inter-domain VPN gateway
> can act as a initiator, i.e. no EAP is used, and neither is behind the
> NAT.
> 
> If one of the inter-domain VPN gateway is behind restricted NAT, then
> it is more or less similar to the remote-access client case (i.e. only
> that end can initiate connections), and as the other peer cannot
> initiate connections to the gw behind NAT, there is no point of
> supporting token taker on that end.
> -- 
> kivi...@iki.fi
> _______________________________________________
> IPsec mailing list
> IPsec@ietf.org
> https://www.ietf.org/mailman/listinfo/ipsec
> 
> Scanned by Check Point Total Security Gateway.

_______________________________________________
IPsec mailing list
IPsec@ietf.org
https://www.ietf.org/mailman/listinfo/ipsec

[IPsec] Issue #197 - More text needed to describe RFC4306^H^H^H^H5996 recovery

Reply via email to