Over the previous year, I was the Project Leader for a Windows Vista implementation of the RSA SecurID OTP EAP methods (15 & 32 - Protected OTP). During this project we had a number of setbacks, and I achieved a new enlightenment about where some long suffering problems with our user interactive EAP methods actually come from.
The primary goal of this project was to port our XP client to Vista and use the new Windows EAP API, EAPHost. The first problem we found is that Vista EAPHost doesn't support RAS connections (PPP and VPN), only 802.1X connections. This has been "fixed" in Windows 7. However, we needed to support those connection methods, and we shelved the EAPHost version in favor of upgrading the existing RasEap API implementation (which all Microsoft methods use anyways). A major problem we have run into with SecurID EAP is timeouts by access points during the authentication process. As I've discussed before (http://www.ietf.org/proceedings/66/slides/emu-4/sld1.htm) some access point implementations have a very short timer on EAP authentication turnaround. A typical SecurID token generates a new One Time Password (OTP) value once every minute. A row of indicators indicates where the token is in the cycle, and some users prefer to wait for a new token code, instead of starting to make an entry and having it change before they are done. Furthermore, if the user is more than 3 minutes out of sync with the server, they may also have to wait for a second code and enter it. This leads to user input times on the order of a minute easy to possibly 3 minutes total. Independently, the system Admin may also force a PIN code change, (similar to a password change) and this process needs PIN input as well as another token value, before authentication is granted. Given our experience, both of our EAP methods have designed-in protocol messages to generate EAP keep-alive traffic while the user interacting with their token. Our next major problem was that Windows Vista changed the way that EAP user interfaces were supported, and our implementation that supported this was utterly broken. In order to both display a dialog to the user, and send network keep-alive messages, we had to get two things going at once. This is not trivially possible inside the Windows EAP environment as most of the code is of a single-threaded callback and return nature. Our XP client would create an extra thread in the EAP UI environment, and support the user dialog from there. When a keep-alive message was needed, it would return to the network code to send it. And when the response returned, it would call back to the UI host code, which was still running in the same multithreaded server process, and re-establish a shared context. Vista broke this by moving the UI host environment to a single threaded COM apartment, and terminates the entire process when it returns status to the network component. I first tried to just end around the Windows EAP UI and invoke my own UI process. At first that worked just fine, until I let 60 seconds go by. Then the session was shutdown by the Windows 802.1X state machine. This is not even an inactivity timer. It would kill the session after one minute even if it had just seconds before sent a response to the server. Further tracing and an open support ticket revealed that the Windows 802.1X state machine would not let an authentication stay open for longer than 1 minute unless it had a UI active. If I had a UI active, it would allow 5 periods. I asked, but was not offered any interfaces that could manipulate this timer or alter it's sense of the UI state. Microsoft's recommendation was to re-build our XP solution but using external processes instead of multi-threading their new UI infrastructure. I ended up implementing a separate UI service process that ran the input dialogs. This process was started from the EAP UI context when an authentication began, and would be contacted every keep-alive cycle, until the user input was complete. Debugging the interprocess communication and state machines took a little bit more time. While I was trying to understand the issues involved in the client, I started experimenting with alternate approaches in the server. The RasEap API offers two transmit message calls to the EAP server (sorry - Authenticator); one EAPACTION_SendWithTimeoutInteractive which was what the code was using or EAPACTION_SendWithTimeout, which might have offered a way for the server to catch a timeout and act on it. To get more information I sent Microsoft a support request to explain the differences between the two. The answer I got was that EAPACTION_SendWithTimeoutInteractive had a 30 second timeout, and EAPACTION_SendWithTimeout had a 60 second timeout. What I was looking for was any way the server could catch the timeout and resend the previous message. I noticed that the code to do that was there, but the event never happened. So, I tried the SendWithTimeout to see if an extra 30 seconds would get me some more breathing room. Instead the authentication failed immediately. I finally busted out Kismet and Wireshark to find out WTF was going on. I wanted to know how an internal EAP server call was affecting the client EAPOL AP authentication in the air. I discovered that the Windows IAS server was using the API action code to set a Session-Timeout value in the RADIUS message carrying the EAP request message. For the SendWithTimeoutInteractive it was setting the value to 30. For the SendWithTimeout it was setting it to 6, not 60 as I had been told. Furthermore, the access point I use, the Cisco Aironet 1230 [v12.3(8)JEC2] would take those timeout values and treat them as a session expiration timer, instead of a re-transmit timer. If a message wasn't received in that period, it would abort the authentication. I do not think this is what is intended by RFC 3580, sect 3.17, page 11: When sent in an Access-Challenge, this attribute represents the maximum number of seconds that an IEEE 802.1X Authenticator should wait for an EAP-Response before retransmitting. Hunting around for an explanation of 1200 AP behavior was difficult to find. At one time I found a Meetinghouse website that mentioned that you increase the "Session timeout" for interactive protocols, but that disappeared. And on my AP's management GUI for this particular parameter does not show on that page. I was finally able to locate in a Cisco command reference that recent versions of IOS that lets you extend the timeout default to a max of 120 seconds and (on some platforms) allow it to ignore the RADIUS Session-Timeout value. dox1x timeout supp-response 120 local This command doesn't seem to have a GUI equivalent on my AP. 120 seconds, 2 minutes, is not as much as I would like, but this is only for one EAP message request/response. Normally a user won't take that long for one step of the authentication. I still don't know why it doesn't retransmit, or if it can at all. The 802.1X specs are extremely vague as to what the values in this area should be, but it seems that retransmission would be highly desirable for better reliability while authenticating. Eventually we were able to put the whole thing together. There are Release Notes that discuss the issues and parameters that allow you to tune it as desired. Our new EAP UI process can ride out the keep-alive cycles in it's own execution context and it tries to keep the MS dialog transitions happy by generating clicks on the annoying “a response is required” dialogs or balloons. An extra feature is a short term username/identity cache to allow a quick and silent IdentityResponse if the first didn't get the session started. I am now testing this on Windows 7, and have only found one EAP interface problem to date. Design/Architecture problems I see: - Windows 802.1X has a fixed timeout (5x60s), there really should be some sort of override on a per-method basis somewhere. A one size fits all timer seems rather presumptuous. - The timeout should be sensitive to activity (or progress), not just time or UI state. Our worse case authentication could have both a PIN change and a Token resync which would require 3 token codes (3 minutes) not including user typing and reaction time. - The Windows EAP infrastructure doesn't easily support methods that need to do two things at once. It would be useful to queue events or wake up threads based on timers. - Access Points (Cisco in this case) should not be setting short timers for user interactive methods. Human input really requires timeouts on the order of several minutes. - Why is the AP killing sessions instead of retransmitting? - Does any AP retransmit? Why not? It's in the 802.1X state machine. The MS PPTP tunnel server does it frequently and my code tolerates it well. - Note that given the AP behavior, if the RADIUS Server gave the EAP method authenticator a timeout event (which I've yet to observe), for a potential retransmission, the AP would not accept the message, because it's using the value for a different purpose. Another potential way to solve the keep-alive problem would have been for the client to use a retransmitted request (and the network context thread that receives and processes it) as a way to send a keep-alive message. But given the AP and RADIUS behavior, I was not able to follow that path. Dave Mitton, RSA Security, Division of EMC Bedford, MA PS: I forgot about the EAP Notification message bug in Vista. If you use the EapRas API method, wireless EAP Notification messages you might receive are discarded silently and without a response to the authenticator. (for some reason RAS EAP connections actually give you the message and let you respond) If you have a server that sends Notification messages, your protocol will break at this point. On XP even if the wireless method didn't receive the message, the underlying code would send a response. There is now a patch for this KB967802. Evidently EAPHost will let you get your message. _______________________________________________ Emu mailing list Emu@ietf.org https://www.ietf.org/mailman/listinfo/emu