[IPsec] Comments to draft-ietf-ipsecme-failure-detection-00

Tero Kivinen Wed, 08 Sep 2010 03:07:20 -0700

The section 2 describing RFC4306 crash recovery is not complete. It
does not include the normal processing happining on the peer that is
not rebooting.


I suggest adding following text:
----------------------------------------------------------------------
When the one peer loses state or reboots it might not be able to
recover immediately (especially in case of reboot). This means that at
first the peer just goes silent, i.e. does not send or respond to any
messages. Conforming IKEv2 implementation will detect this situation
and follow the rules given in the section 2.4:

              "If there has only been outgoing traffic on all of
   the SAs associated with an IKE SA, it is essential to confirm
   liveness of the other endpoint to avoid black holes.  If no
   cryptographically protected messages have been received on an IKE SA
   or any of its Child SAs recently, the system needs to perform a
   liveness check in order to prevent sending messages to a dead peer."

I.e. the peer usually will start liveness checks even before the other
end is sending INVALID_SPI notification, as it detected that the other
end is not sending any packets anymore while it is still rebooting or
recovering from the situation.

This means that the several minutes recovery period is overlaping the
actual recover time of the other peer, i.e. if the security gateway
requires several minutes to boot up from the crash then the other
peers have already finished their liveness checks before the crashing
peer even has change to send INVALID_SPI notifications.

There are cases where the peer looses state and is able to recover
immediately, in those cases it might take several minutes to recover.

Note, that IKEv2 specification specifically leaves number of
retries and lengths of timeouts out from the specification, as they do
not affect interoperability. This means that implementations are
allowed to use the hints provided by the INVALID_SPI messages as hints
that will shorten those timeouts (i.e. different environment and
situation requiring different rules).

Good existing IKEv2 implementations already do that (i.e. both shorten
timeouts or limit number of retries) based on that kind of hints and
also start liveness checks quickly after the other end goes silent.
----------------------------------------------------------------------

The final paragraph saying:

   Those "at least several minutes" are a time during which both peers
   are active, but IPsec cannot be used.

is incorrect, as it is only true when the crashed peer recoverd
instantenously. In normal case most of that time is actually
overlaping the recovery time of the peer.

--

The protocol currently says that:

   Supporting implementations will send a notification, called a "QCD
   token", as described in Section 4.1 in the last IKE_AUTH exchange
   messages.  These are the final IKE_AUTH request and final IKE_AUTH
   response that contain the AUTH payloads.

This is very differnet compared to all other processing, usually this
kind of payloads are put to the same packet that contains traffic
selectors etc. Is there some reason why this is done this way?

--

Also do we really need the QCD token for the initiator too? The
initiator has already proven to be able to create the IKE SA on its
own, and it will have enough information to recreate the IKE SA after
the boot. Responder usually does not have enough information to be
able to recrete the IKE SA on its own after reboot, as it might not
for example know anymore what was the peer address where the IKE SA
was connected to when it just has IP packet it needs to forward to
that peer. The initiator must already have that information as he was
able to trigger IPsec SA creation in the first place based on the ip
packet.

I think it would simplify the implementations and the protocol by just
limiting that only responders can be token makers without loosing any
of the functionality. 

--

Section 7.4 is mostly wrong.

The default retransmission policy needed for mobike cases is much,
much longer than what is needed in normal case. When mobike switches
from one interface to the another there might be very long delays
because of this (for example the device first needs to notice that
old interface does not work anymore, and then perhaps it needs to run
dhcp and other link related protocols on the new interface before it
can even try it and all those take a long time).

For example in our implementation the mobike uses MUCH longer timeouts
just to make sure we do not time out the IKE exchanges while we are
trying to go through all possible interfaces etc. Because of those
even longer timeouts there is very good reason to shorten those
timeouts in case we get any feedback back from the other end (i.e.
INVALID_SPI notifications).

The timeouts used in different situations even in the same
implementation needs to be different. In our case when you enable
mobike the number of retries used is more than 2 times what it is if
you do not turn mobike on.

Also the last paragraph again assumes that the peer staying up didn't
start liveness check almost immediately when the crashing peer
crashed. This is something that is already part of the standard IKEv2
specification, so implementions need to do that. This means the
timeout starts from the time of the crash, not from the time when the
gateway is up again.

Anyways as all this is standard IKEv2 already it does not belong here
in the alternative solutions section, but belongs as part of the
section 2.

--

Section 8 again ignores the IKEv2 text saying:

              "If there has only been outgoing traffic on all of
   the SAs associated with an IKE SA, it is essential to confirm
   liveness of the other endpoint to avoid black holes.  If no
   cryptographically protected messages have been received on an IKE SA
   or any of its Child SAs recently, the system needs to perform a
   liveness check in order to prevent sending messages to a dead peer."

Especially the text:

                                  A failed gateway may go undetected
   for as long as the lifetime of a child SA, because IPsec does not
   have packet acknowledgement, and applications cannot signal the IPsec
   layer that the tunnel "does not work".  

If the gateway has failed then if there is ANY traffic on any of the
IPsec SAs then that means that from the other peers point of view
there is only outgoing traffic, thus it needs to do liveness check to
verify that the other end is alive. Thus the failed gateway cannot
really go undetected for as long as the lifetime of child SA, unless
the lifetimes is in order of few minutes :-)

I know there are implementations who do not implement that part of the
IKEv2 specification, but that does not mean that the part is not
there. We should not write or specifications to cover broken
implementations, but try to assume that implementations are following
the IKEv2 specification.

Note that the IKEv2 text does not have any conditionals there, it says
that "...the system needs to perform a liveness check...". It does not
say it may, or even should do it, it says it needs to be done.

Also I think the picture itself is bit incorrect, the exchange after
the reboot should probably be:

                ---- Reboot -----

       HDR, SK {}          -->

                        <--  HDR, N(INVALID_IKE_SPI), N(QCD_TOKEN)


I.e I assume the first packet is normal liveness check, and the reply
that is normal INVALID_IKE_SPI with QCD_TOKEN. 

--

In section 9.1. it says that inter-domain VPN gateways should do both,
but I think that inter-domain VPN gateways does not really need this
specification as all, as they by configuration do know the other ends
IP-addresses etc, thus when the inter-domain VPN gateway gets up, it
can immediately create the IKE SAs needed based on the configuration.
This is in the case where either end of the inter-domain VPN gateway
can act as a initiator, i.e. no EAP is used, and neither is behind the
NAT.

If one of the inter-domain VPN gateway is behind restricted NAT, then
it is more or less similar to the remote-access client case (i.e. only
that end can initiate connections), and as the other peer cannot
initiate connections to the gw behind NAT, there is no point of
supporting token taker on that end.
-- 
kivi...@iki.fi
_______________________________________________
IPsec mailing list
IPsec@ietf.org
https://www.ietf.org/mailman/listinfo/ipsec

[IPsec] Comments to draft-ietf-ipsecme-failure-detection-00

Reply via email to