Alexey, How you are going to deal with distributed nature of Ignite cluster? And how do you propose handle nodes restart / stop?
On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin < alexeykukush...@yahoo.com.invalid> wrote: > Hi Denis, > Monitoring tools simply watch event logs for patterns (regex in case of > unstructured logs like text files). A stable (not changing in new releases) > event ID identifying specific issue would be such a pattern. > We need to introduce such event IDs according to the principles I > described in my previous mail. > Best regards, Alexey > > > On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda < > dma...@apache.org> wrote: > > Hello Alexey, > > Thanks for the detailed input. > > Assuming that Ignite supported the suggested events based model, how can > it be integrated with mentioned tools like DynaTrace or Nagios? Is this all > we need? > > — > Denis > > > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <alexeykukush...@yahoo.com > .INVALID> wrote: > > > > Igniters, > > While preparing some Ignite materials for Administrators I found Ignite > is not friendly for such a critical DevOps practice as monitoring. > > TL;DRI think Ignite misses structured descriptions of abnormal events > with references to event IDs in the logs not changing as new versions are > released. > > MORE DETAILS > > I call an application “monitoring friendly” if it allows DevOps to: > > 1. immediately receive a notification (email, SMS, etc.) > > 2. understand what a problem is without involving developers > > 3. provide automated recovery action. > > > > Large enterprises do not implement custom solutions. They usually use > tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the > enterprise consistently. All such tools have similar architecture providing > a dashboard showing apps as “green/yellow/red”, and numerous “connectors” > to look for events in text logs, ESBs, database tables, etc. > > > > For each app DevOps build a “health model” - a diagram displaying the > app’s “manageable” components and the app boundaries. A “manageable” > component is something that can be started/stopped/configured in isolation. > “System boundary” is a list of external apps that the monitored app > interacts with. > > > > The main attribute of a manageable component is a list of “operationally > significant events”. Those are the events that DevOps can do something > with. For example, “failed to connect to cache store” is significant, while > “user input validation failed” is not. > > > > Events shall be as specific as possible so that DevOps do not spend time > for further analysis. For example, a “database failure” event is not good. > There should be “database connection failure”, “invalid database schema”, > “database authentication failure”, etc. events. > > > > “Event” is NOT the same as exception occurred in the code. Events > identify specific problem from the DevOps point of view. For example, even > if “connection to cache store failed” exception might be thrown from > several places in the code, that is still the same event. On the other > side, even if a SqlServerConnectionTimeout and OracleConnectionTimeout > exceptions might be caught in the same place, those are different events > since MS SQL Server and Oracle are usually different DevOps groups in large > enterprises! > > > > The operationally significant event IDs must be stable: they must not > change from one release to another. This is like a contract between > developers and DevOps. > > > > This should be the developer’s responsibility to publish and maintain a > table with attributes: > > > > - Event ID > > - Severity: Critical (Red) - the system is not operational; Warning > (Yellow) - the system is operational but health is degraded; None - just an > info. > > - Description: concise but enough for DevOps to act without developer’s > help > > - Recovery actions: what DevOps shall do to fix the issue without > developer’s help. DevOps might create automated recovery scripts based on > this information. > > > > For example: > > 10100 - Critical - Could not connect to Zookeeper to discovery nodes - > 1) Open ignite configuration and find zookeeper connection string 2) Make > sure the Zookeeper is running > > 10200 - Warning - Ignite node left the cluster. > > > > Back to Ignite: it looks to me we do not design for operations as > described above. We have no event IDs: our logging is subject to change in > new version so that any patterns DevOps might use to detect significant > events would stop working after upgrade. > > > > If I am not the only one how have such concerns then we might open a > ticket to address this. > > > > > > Best regards, Alexey > -- Alexey Kuznetsov