Interestingly the point you make about an overloaded network segment possibly causing dropped packets might be a likely candidate as the network is certainly more heavily utilised on that site than at any other and as we are not in control of any of the platform configurations, network schematics etc it is beyond our control/scope to do anything about.
I am adding extra trace in 4 additional events and logging the socket states and error codes in the OnSessionAvailable, OnDebugAvailable, OnClientCreate and OnSessionClosed events. Previously we were not connecting any event handlers to these so were unable to log the state. Analysis by the Prime contractor on the customers site appeared to suggest that an FD_ACCEPT message was being processed but with an error code as they reported that using a network analysis trace the socket initialization was started correctly but that the log entries written in the OnClientConnect were not written. I looked through the source and can see in the TriggerSessionAvailable handler the line If Error <> 0 then Exit; And this is done before the construction of a 'client socket' object with which to handle the connection and also prior to the point where the OnClientConnect method is called. So I am guessing an error is being passed in the LParam of the message. Hopefully by attaching the OnSessionAvailable event we might be able to capture what this error is and the be able to understand why this site has a particular problem. If and when I receive these additional logs I will post any conclusions here. Best regards, Damien. -----Original Message----- From: twsocket-boun...@elists.org [mailto:twsocket-boun...@elists.org] On Behalf Of Francois PIETTE Sent: Tuesday, August 03, 2010 5:19 PM To: ICS support mailing Subject: Re: [twsocket] TWSocketServer OnConnection event > I have been asked to investigate a strange issue we are encountering at > a > customer site in Mexico. I am a contractor for a company which supplied > surveillance and monitoring software based on the ICS component set. The > software runs fine on other sites with no problems encountered for over 8 > months but on the site in Mexico after a matter of hours or days the > software (and or server) crashes. > The servers are all identical HP Blade servers running Windows Server 2003 > vanilla installs. This is true of sites that are functioning and the ones > in > Mexico that are not. If the software runs fine on several indentical systems and fails on a single system, I would concentrate on what make that failing site different because it has to be different. Fist check the service pack level. I suggest first to verify that no malware is intercepting winsock calls. This is done by malware to capture trafic. Then, I would check if any suspect LSP is not installed on the system. Also check if some security products are not interfering with winsock: they frequently intercept winsock calls to block some kind of trafic. Those security products could be buggy. > My analysis of the problem to date suggests that an OnClientConnect is > firing but the passed Client object is incomplete or invalid. The code for > the OnClientConnect event does not check the ErrorCode and accepts the > connection but traffic appears not to flow correctly between client and > server. I suggest checking the error code and reporting it into the logile for analisys. > if I run > NetStat on the server it appears a windows socket object is left in > FIN-WAIT > 1 or FIN-WAIT2 state. Eventually the system fails as all windows socket > objects are expended and there is a catastrophic failure of the software > and/or server. > the steps that should be taken when an error does occur to ensure that > the windows sockets are correctly 'cleaned > up' and released back to the Operating System ? FIN-WAIT-1 and FIN-WAIT-2 means the orderly shutdown sequence is occuring but the remote site do not answer (Have a look here: http://www.tcpipguide.com/free/t_TCPConnectionTermination-2.htm). An orderly shutdown is a multiple steps sequence between client and server. What is strange here is that FIN-WAIT-1 and FIN-WAIT-2 states are client side states, not server side. So it is possible that the socket you see in that sate are NOT the one failing. Maybe something else is failing (maybe in the same software) causing those sockets to be in those states and consume all available sockets which cause trouble in the software for accepting a new connection because accepting a new connection means creating a new socket. So I see the possibility that some other software or another part of your software has an issue with /client/ connection close, this result in a lot of sockets in the FIN-WAIT-1 or FIN-WAIT-2 state, consuming all available socket and making new connection acceptance failure. Why those client connexions could have problems with their server not answering ? This could be cause by malware sending forget IP packets to break existing connection or a misconfiguring security product (firewall) dropping packets, or simply an overloaded network segment which is dropping packets because trafic is too high. An overloaded layer 2 switch may simply drop packets when is it not able to switch the packets fast enough. -- francois.pie...@overbyte.be The author of the freeware multi-tier middleware MidWare The author of the freeware Internet Component Suite (ICS) http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be -- To unsubscribe or change your settings for TWSocket mailing list please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket Visit our website at http://www.overbyte.be