RE: [SA-list] First line of check fails after some modifications

Dirk Mon, 09 Jun 2008 10:01:57 -0700

Yes that's what I'm saying.
ON a system were the handle leak can be spotted, it might be a good idea to try
to determine what check type is giving that leak (on that system).  This can be
done by only having 1 type of checks, and let it run for a while, if no leak is
spotted, then add the check of another type,...
Dirk Bulinckx.


-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf Of
GLENN GASPAR
Sent: Monday, June 09, 2008 6:40 PM
To: Servers Alive Discussion List
Subject: RE: [SA-list] First line of check fails after some modifications

Dirk,

     Are you saying that 2226 should not have any handle leaks?

     Also, what do you mean by "Can you bring that back to a certain check
type?"

Thanks,
Glenn

>>> "Dirk" <[EMAIL PROTECTED]> 6/9/2008 11:15 AM >>>
With the latest builds (those after 2212) we're no longer able to get that
handle leak.
Can you bring that back to a certain check type?

Dirk Bulinckx. 

-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf Of
GLENN GASPAR
Sent: Monday, June 09, 2008 6:05 PM
To: Servers Alive Discussion List
Subject: RE: [SA-list] First line of check fails after some modifications

Dirk,

     We've done a full uninstall of SA and reinstalled 6.1.2226. That appeared
to solve the problems we're having regarding the ping checks failing.

     However, after a few problem-free days, the log file stopped being updated
(SA service is still running but it is "Not Responding"). The number of handles
has grown to 8,000+. This is the same type of problem we've gotten a few months
ago.

     Looking at previous correspondences, this seems to be a problem that other
people have encountered. A couple of people said that they have written scripts
to restart SA when the log files have stopped being updated for a few hours.

Glenn

>>> "Dirk" <[EMAIL PROTECTED]> 5/30/2008 12:20 AM >>>
Running SA without the threading is not supported.
The logging won't show a reason for a slow down.
Are you using a lot of URL checks?


Dirk Bulinckx. 

-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf Of
GLENN GASPAR
Sent: Friday, May 30, 2008 12:40 AM
To: Servers Alive Discussion List
Subject: Re: [SA-list] First line of check fails after some modifications

I could try. We've always had minimum logging turned on and it slowed down. I
just turned on maximum logging so we could get more info.

My boss wanted me to ask if there's a way to turn off multi-threading? 

I just checked Task Manager and the thread count seemed to double from when I
checked it while it was still at normal speed vs when it slowed down.  (~80  to
199) Also the number of handles grew significantly.

I can send you the dbgview log file if that will help.

Thanks,
Glenn

>>> "Dirk Bulinckx" <[EMAIL PROTECTED]> 5/29/2008 5:00 PM >>>
if you disable file based logging does it then slow down too?


Dirk Bulinckx
----- Original Message ----- 
From: "GLENN GASPAR" <[EMAIL PROTECTED]>
To: "Servers Alive Discussion List" <[email protected]>
Sent: Thursday, May 29, 2008 11:50 PM
Subject: Re: [SA-list] First line of check fails after some modifications


We don't have db logging on; it's not available.

Glenn

>>> "Dirk Bulinckx" <[EMAIL PROTECTED]> 5/29/2008 4:35 PM >>>
so it's still doing the ping and logging them to dbgview.
are you also using db logging?  if so could you check if the db isn't almost
full? (typical pb with Access or SQL Server Express)



Dirk Bulinckx
----- Original Message ----- 
From: "GLENN GASPAR" <[EMAIL PROTECTED]>
To: "Servers Alive Discussion List" <[email protected]>
Sent: Thursday, May 29, 2008 11:20 PM
Subject: RE: [SA-list] First line of check fails after some modifications


We've restarted Servers Alive and ran dbgview a few hours ago. It has
started to slow down about half an hour ago. Here's an excerpt of the
dbgview output:

[2872] Created new instance of wsPingThr!
[2872] thPingStart - exit loop forced after more then 30 seconds (16:07:13)
[2872] SA VB ping event START for www.ggc.com 316- 403
[2872] DEBUG: thpingstart pingEVENT 316-www.ggc.com
[2872] SA VB entry_thr_check_post start  316- 403
[2872] SA DEBUG start of alerting engine  316- 403
[2872] SA DEBUG end AlertEngines_DoAlerts: none defined for this entry  316-
403
[2872] SA DEBUG end of alerting engine  316- 403
[2872] SA DEBUG adapt icon  316- 403
[2872] SA DEBUG end entry_thr_check_post  316- 403
[2872] SA VB ping event END for www.ggc.com 316- 403


>>> "Dirk" <[EMAIL PROTECTED]> 5/29/2008 10:50 AM >>>
Not sure what I should see in this log....
If you run dbgview next to SA
(http://beta.woodstone.nu/soft/temp/debugview.zip), does this still show the
ping loggings?


Dirk Bulinckx.

-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf
Of
GLENN GASPAR
Sent: Thursday, May 29, 2008 5:40 PM
To: Servers Alive Discussion List
Subject: RE: [SA-list] First line of check fails after some modifications

Dirk,

     We turned on full logging and discovered that PING DEBUG stopped
appearing
in the log file right about the time the check cycle started slowing down
considerably. Below is an excerpt of the log. (I can send you more of it if
necessary). Any ideas?

Thanks,
Glenn


Wednesday, May 28, 2008 10:24:19 PM NETWARE took 2469 ms(110)
Wednesday, May 28, 2008 10:24:19 PM Westlake Router 72.1
Wednesday, May 28, 2008 10:24:20 PM PING DEBUG: event start for ID= 0
pingID=
34
Wednesday, May 28, 2008 10:24:20 PM Westlake Router 72.1 OK with a
successrate
of 100% and an average roundtriptime of 104ms
Wednesday, May 28, 2008 10:24:20 PM PING DEBUG: REMOVE for ID= 0 pingID=  34
Wednesday, May 28, 2008 10:24:20 PM PING DEBUG: REMOVED for ID= 0 pingID=
34
Wednesday, May 28, 2008 10:24:20 PM Lake Charles VCM PIMS Server
Wednesday, May 28, 2008 10:24:20 PM Lake Charles chemstation server
Wednesday, May 28, 2008 10:24:20 PM Lake Charles Server WLADM1
Wednesday, May 28, 2008 10:24:21 PM PING DEBUG: event start for ID= 0
pingID=
73
Wednesday, May 28, 2008 10:24:21 PM Lake Charles VCM PIMS Server OK with a
successrate of 100% and an average roundtriptime of 166ms
Wednesday, May 28, 2008 10:24:21 PM PING DEBUG: REMOVE for ID= 0 pingID=  73
Wednesday, May 28, 2008 10:24:21 PM PING DEBUG: REMOVED for ID= 0 pingID=
73
Wednesday, May 28, 2008 10:24:21 PM PING DEBUG: event start for ID= 1
pingID=
35
Wednesday, May 28, 2008 10:24:21 PM Lake Charles chemstation server OK with
a
successrate of 100% and an average roundtriptime of 170ms
Wednesday, May 28, 2008 10:24:21 PM PING DEBUG: REMOVE for ID= 1 pingID=  35
Wednesday, May 28, 2008 10:24:21 PM PING DEBUG: REMOVED for ID= 1 pingID=
35
Wednesday, May 28, 2008 10:24:35 PM NETWARE took 14391 ms(126)
Wednesday, May 28, 2008 10:24:35 PM Madison Router 100.1
Wednesday, May 28, 2008 10:25:06 PM Madison Router 100.1 OK with a
successrate
of 100% and an average roundtriptime of 35ms
Wednesday, May 28, 2008 10:25:06 PM Madison Server MDADM1
Wednesday, May 28, 2008 10:25:08 PM NETWARE took 2578 ms(106)
Wednesday, May 28, 2008 10:25:08 PM Madison Router 100.1 - 2nd check
Wednesday, May 28, 2008 10:25:39 PM Madison Router 100.1 - 2nd check OK with
a
successrate of 100% and an average roundtriptime of 35ms
Wednesday, May 28, 2008 10:25:39 PM Ross D/R Database Server
Wednesday, May 28, 2008 10:26:10 PM Ross D/R Database Server OK with a
successrate of 100% and an average roundtriptime of 35ms
Wednesday, May 28, 2008 10:26:10 PM Ross D/R Application Server
Wednesday, May 28, 2008 10:26:41 PM Ross D/R Application Server OK with a
successrate of 100% and an average roundtriptime of 36ms


>>> "Dirk" <[EMAIL PROTECTED]> 5/23/2008 4:55 PM >>>
Without a snif of what is happening it will be difficult to say for sure if
there is a problem or not.

It's normal that IF the remote systems don't respond to the ping that the
checking itself is slower.

With the 15s timeout, the check itself can take 15s, but if the remote
system
does respond then the check is maybe done in 100ms.


Dirk Bulinckx.

-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf
Of
GLENN GASPAR
Sent: Friday, May 23, 2008 11:35 PM
To: Servers Alive Discussion List
Subject: RE: [SA-list] First line of check fails after some modifications

Well, never mind.. As soon as I logged out of the server that SA is running
on (
I was logged in for about an hour) it started slowing down again.

Glenn

>>> "GLENN GASPAR" <[EMAIL PROTECTED]> 5/23/2008 4:25 PM >>>
Dirk,

     I don't have the tools/know-how to run a sniffer... However, I observed
something that is quite strange...
Before restarting SA about an hour ago, I turned off the primary & alternate
SMTP mail. I then restarted the SA service and for about an hour now there
hasn't been any slowdown. All the ping checks seem to be running at normal
speed.

Glenn

>>> "Dirk" <[EMAIL PROTECTED]> 5/23/2008 2:25 PM >>>
This means that the frames are send and are not coming back within the given
timeout.

Example:
10 frames
15 seconds timeout
=> frame 1 is send and we wait a max of 1.5 seconds, if we get a response
back
from the pinged IP then we flag it as a GOOD frame else as a BAD frame
=> frame 2 is send ...
...
at the end we see how many GOOD frames we have a calculate the %


so for some reason your pinged hosts start to fail, if you get that 0%, try
running a sniffer (ethereal/wireshark/netmon/...) to see if the frames are
send
and IF they come back too.


Dirk Bulinckx.

-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf
Of
GLENN GASPAR
Sent: Friday, May 23, 2008 9:00 PM
To: Servers Alive Discussion List
Subject: RE: [SA-list] First line of check fails after some modifications

It is taking more time...

Here's an excerpt of the log:

Friday, May 23, 2008 1:13:55 PM Atlanta Router 20.1 failed due to a
successrate
of only 0%
Friday, May 23, 2008 1:13:55 PM Houston Router 36.1 failed due to a
successrate
of only 0%

Both of these routers have 15 seconds in timeout value and have "second
knock"
checked.

Most if not all of the entries in the log after I restarted the server are
saying "x failed due to a successrate of only 0%".

Glenn
>>> "Dirk" <[EMAIL PROTECTED]> 5/23/2008 1:36 PM >>>
it "seems" or it "is" taking more time?
The roundtrip time is in the GUI so you can see it by that value

Dirk Bulinckx.

-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf
Of
GLENN GASPAR
Sent: Friday, May 23, 2008 8:30 PM
To: Servers Alive Discussion List
Subject: RE: [SA-list] First line of check fails after some modifications

It seems that the ping checks would take more time than usual. We give our
router checks a timeout value of either 5, 10, 15 seconds (depending on
location).

After about an hour or so after the restart the ping checks to the routers
are
very slow that it looks like their timing out.

Glenn
>>> "Dirk" <[EMAIL PROTECTED]> 5/23/2008 12:36 PM >>>
Define "slow"

Dirk Bulinckx.

-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf
Of
GLENN GASPAR
Sent: Friday, May 23, 2008 7:05 PM
To: Servers Alive Discussion List
Subject: RE: [SA-list] First line of check fails after some modifications

Dirk,

    Just want to add additional observations... I just restarted the server
that
runs SA and initially it ran fine. After the 1st cycle of checks however, it
seems to slow down for some reason (the checks came back ok though but its
really slow).

Glenn

>>> "Dirk" <[EMAIL PROTECTED]> 5/23/2008 11:05 AM >>>
What type of checks are you using (that give the 'false' downs)?
What does SA show as reason for the down?
What exact version of Servers Alive are you using?


Dirk Bulinckx.

-----Original Message-----
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf
Of
GLENN GASPAR
Sent: Friday, May 23, 2008 5:55 PM
To: Servers Alive Discussion List
Subject: [SA-list] First line of check fails after some modifications

Hello,

      In the past we've noticed that every now and then Servers Alive would
report that some of our routers (first line of checks) are down where in
reality
they are still up and running. We would restart SA and then the problem
seems to
go away.

      However this past Tuesday, I made some SNMP check modifications as
some of
our servers Open Manager versions were updated and I've also added entries
in
the People group for email notifications. After I made these modifications
it
seems that SA would act normally and then after a few hours it would report
that
most if not all of our routers are down. Any suggestions?

Thanks,
Glenn Gaspar

To unsubscribe send a message with UNSUBSCRIBE in the subject line to
[email protected] 
If you use auto-responders (like out-of-the-office messages), make sure that
they are not sent to the list nor to individual members.  Doing so will cause
you to be automatically removed from the list.

To unsubscribe send a message with UNSUBSCRIBE in the subject line to
[email protected] 
If you use auto-responders (like out-of-the-office messages), make sure that
they are not sent to the list nor to individual members.  Doing so will cause
you to be automatically removed from the list.=


To unsubscribe send a message with UNSUBSCRIBE in the subject line to
[email protected] 
If you use auto-responders (like out-of-the-office messages), make sure that
they are not sent to the list nor to individual members.  Doing so will cause
you to be automatically removed from the list.

To unsubscribe send a message with UNSUBSCRIBE in the subject line to
[email protected]
If you use auto-responders (like out-of-the-office messages), make sure that
they are not sent to the list nor to individual members.  Doing so will cause
you to be automatically removed from the list.=


To unsubscribe send a message with UNSUBSCRIBE in the subject line to 
[email protected]
If you use auto-responders (like out-of-the-office messages), make sure that 
they are not sent to the list nor to individual members.  Doing so will cause 
you to be automatically removed from the list.

RE: [SA-list] First line of check fails after some modifications

Reply via email to