Hello all,
As someone who has worked a lot on logging in this project, I feel like
a set of log guidelines are a good addition. I generally agree with
Daan's ideas; however, I think that the error level was oversimplified.
As Daan already said, there are two main reasons for error logs: "the
system can not execute a user request or part of the system is degraded".
When the system cannot execute a user request, a simple message
explaining why it's not possible to execute the request is often enough.
However, when the error happens because of system degradation, there are
times when having a stacktrace can save hours of debugging, as the error
might be hard to reproduce. Also, if the error is occurring in a
production environment, changing the log level to debug means restarting
ACS's components, possibly leading to downtime and/or more errors. The
second issue will be solved with the log4j2 PR, but the first one can
still be a pain.
In summary, a hard rule would not address all situations, but a
guideline or good practices guide would be a great addition to ACS.
Best regards,
João Jandre
On 12/4/23 10:51, Daan Hoogland wrote:
There have been some discussions on github PRs about the subject of
proper logging and I want to use this to start a discussion and come
to some guidelines that hopefully everybody can agree to.
My idea is the following
Fatal:
- include a reason why the system can not function anymore,
- be followed by a System.exit() call,
- include a stack trace if it is reasonable to expect this would give
properly helpful information.
I did a quick search and we have 2 fatal calls in the system. Neither
is followed by a System.exit() and I think they can be replaced by
error()/debug() combinations
Error:
- include a reason why the system can not execute a user request or
why part of the system is degraded.
- should *not* include a stack trace
Warn:
- include a reason why the system might not behave as the user is
expecting. (configuration or integration missing for example,)
- should *not* include a stack trace
Info:
- is for informationals to the operator like resource lifecycle
messages or deployment planning considderations
- should *not* include a stack trace
Debug:
- information to be turned on when trouble shooting or developing.
This information should not be needed for normal operation
- if a thrown exception is caught and not rethrown, the stack trace of
such exception
Trace:
- whatever you feel helps you during development of a
feature/enhancement/fix/improvement that is not needed in any kind of
operational way.
As I think the two fatal()s we have are not genuine, I think stack
traces are only for Debug and Trace level, where Debug can be
temporarily turned on in production and Trace is fro developers
exclusively,
I realise the current state of the system is far from this behaviour,
but i like to think we can grow in the direction as I describe and I
hope we can refine this to a generic guideline to be used in reviews.