Sorry, was not saying don't look at logs, just saying logs are only reactive 
and only see things you're logging (if the server crashes you may log nada but 
that's definitely an issue! I also personally find correlation easier when I 
have graphic data but something like the ELK stack could help here, I have 
checks that look at ELK and then alert when they find pertinent data (they 
could also watch logs but this way they're in a single place and can also look 
for negatives (i.e. no one has logged in for 15 minutes is an error even if 
everything else is "green).

"looking at logs is 100% accurate at detecting logged problems :-)" - I'm 
stealing this.



Jeremy M Page | SOA DevOps Engineer| Gilbarco Veeder-Root
Office: 336-547-5399 | Office: 336-601-7274 | www.gilbarco.com | www.veeder.com


________________________________________
From: David Lang [da...@lang.hm]
Sent: Monday, August 24, 2015 9:03 AM
To: Page, Jeremy
Cc: Edward Ned Harvey (lopser); Adam Moskowitz; tech@lists.lopsa.org
Subject: Re: [lopsa-tech] Server Overload and Log Processing

On Mon, 24 Aug 2015, Page, Jeremy wrote:

> I think the problem is looking at it as a binary up/down issue when in fact
> you should be able to determine the problem is occurring when the check takes
> longer than a specific threshold. A page taking a second or two to load is
> going to cause close to the havoc a 404 does.

The problem is when you don't have a site-wide problem, but rather an
intermitten problem, your external test may or may no see it.

If you have 10 systems behind a load balancer, and one of those 10 systems has a
problem, at best your external test is going to get an error 1 out of 10 times
(at worst, your load balancer is going to tend to put your external test to one
server instead of rotating it across all 10, in which case you may never see the
problem)

looking at logs is 100% accurate at detecting logged problems :-)

logs won't detect problems that aren't logged (which is why I think you should
log how long it took to service the request), so you need the external test as
well. But there is a LOT of stuff the extenal test won't detect.

David Lang

> As far as false positives go the same is true for the failed attempt. This is
> why Nagios and most other monitoring systems offer the ability to confirm a
> failed check. Personally I like to check at moderately long intervals but
> recheck quickly if I discover a possible failure.
>
> Finally, as Adam pointed out, just because the page returns does not determine
> that it's functional. Acceptable user experience is (should) be the thing you
> are trying to verify.
>
>
> ________________________________________
> From: tech-boun...@lists.lopsa.org [tech-boun...@lists.lopsa.org] on behalf 
> of Edward Ned Harvey (lopser) [lop...@nedharvey.com]
> Sent: Monday, August 24, 2015 6:51 AM
> To: Adam Moskowitz; tech@lists.lopsa.org
> Subject: Re: [lopsa-tech] Server Overload and Log Processing
>
>> From: tech-boun...@lists.lopsa.org [mailto:tech-boun...@lists.lopsa.org]
>> On Behalf Of Adam Moskowitz
>>
>> I don't see how that can be true: If "a bunch of users" will get errors,
>> I believe your page download tester will also see those same errors. If
>> it's not seeing those errors, what good is it?
>
> If you server can handle 100,000 requests per minute, and you get 101,000 
> requests a minute, then 1% of your users will get "Page cannot be displayed" 
> or something similar. You have a 99% chance that your download tester will 
> fail to detect the problem. If it's sustained, you'll probably detect the 
> problem after 100 minutes, but you really should have detected it sooner, and 
> if you detect the problem only as "page failed to download" by your download 
> test, then you don't know why it failed, and the problem doesn't persist, and 
> you'll probably brush it off as a false alarm.
>
>
>> Yes, you should still be looking at your logs, but I believe that what's
>> more critical is that you monitor the service *from the user's point of
>> view*, and that monitoring should reflect the users' experiences as
>> closely as possible.
>
> Agreed.
> _______________________________________________
> Tech mailing list
> Tech@lists.lopsa.org
> https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
> This list provided by the League of Professional System Administrators
> http://lopsa.org/
> Please be advised that this email may contain confidential information. If 
> you are not the intended recipient, please notify us by email by replying to 
> the sender and delete this message. The sender disclaims that the content of 
> this email constitutes an offer to enter into, or the acceptance of, any 
> agreement; provided that the foregoing does not invalidate the binding effect 
> of any digital or other electronic reproduction of a manual signature that is 
> included in any attachment.
> _______________________________________________
> Tech mailing list
> Tech@lists.lopsa.org
> https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
> This list provided by the League of Professional System Administrators
> http://lopsa.org/
>
Please be advised that this email may contain confidential information. If you 
are not the intended recipient, please notify us by email by replying to the 
sender and delete this message. The sender disclaims that the content of this 
email constitutes an offer to enter into, or the acceptance of, any agreement; 
provided that the foregoing does not invalidate the binding effect of any 
digital or other electronic reproduction of a manual signature that is included 
in any attachment.
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to