DevProjects/BusinessServiceMonitoring

From OpenNMS
Jump to navigation Jump to search

Warning.png Page has moved

The initial development for this project has been completed project and this page is only kept here as a reference. Please refer to the documentation associated with your release for installation and usage notes.

Business Service Monitoring

Goal

Provide enterprises with the ability to correlate faults to business services.

High-Level Features

  1. Business service level fault management
  2. Service hierarchy visualization
  3. Root cause analysis
  4. Business impact analysis
  5. Enterprise reporting

Terminology

Business Service
A service provided by the business. These may rely on other business services and on the state of managed elements.
Service Hierarchy
A service and all of it's related components, down to the alarms and managed elements
IP-Service
A service associated to an IP interface on a node
Managed Element
A node, an interface, or a service
Severity
Possible values include Indeterminate(1), Cleared(2), Normal(3), Warning(4), Minor(5), Major(6) and Critical (7)
State of a Business Service
The current severity associated with a Business Service. May also be referred to as the Business Services's Operational Status.
Status of a Business Service
Whether a Business Service is enabled or disabled. May also be referred to as the Business Service's Administrative Status.

Agile

Actors

Jeff
An OpenNMS Administrator
Paul
An OpenNMS User

Sprint 1 - Application Visualization

  • Start building support for visualizing service hierarchies.
    Applications, which are composed of one or more IP-Services, are used as a starting point since they use a simple hierarchy and are already modeled in the database.
  • Start developing the database schema, object model and DAOs for persisting the business services.

Sprint 2 - Business Service Model

  • Continue developing the business service model and a UI to configure these
  • Evaluate the current state and architecture of the Topology UI, based on our experiences in the previous sprint

Sprint 3 - Status And State Machine

  • Moved the Admin UI from AngularJS to Vaadin
  • Created the Master Status Page (Vaadin Application)
  • Initial implementation of the event-driven state machine

Sprint 4 - Model and Maintenance

  • Improved state machine and fixed various bugs
  • Start adding support for creating multi-level hierarchies

Sprint 5 - Hierarchy and Topology

  • Completed support for multi-level hierarchies
  • Added support for visualizing hierarchies in the Topology UI

Sprint 6 - Map and Reduce

  • Added support for configuration status propagation logic (using map and reduce functions)
  • Fixed various issues while preparing for upcoming sponsor demo

Sprint 7 - XML Schema Definition

  • Added a XML schema definition for the Business Services REST API
  • Improved state machine and fixed various bugs
  • Fully implement the admin ui to configure business services

Sprint 8 - Root Cause and Impact Analysis

  • Added the functionality to do RCA (Root Caues Analysis) and IA (Impact Analysis)
  • Added support to individually configure icons in the Topology UI
  • Fixed various issues while preparing for upcoming sponsor demo

Sprint 9 - Info Window

....

(Evolving) Architecture

The solution's architecture will be developed iteratively as we build out the feature set.

The goal of this section is to give stakeholders a view of the overall solution, and the reasoning behind the various design decisions.

Model

Business services are entities that depend on other business services, or alarms.

Alarms are used to "bridge to gap" between the business services and the existing fault management layer within OpenNMS.

Bpm-model.png

Components

Overview

Bpm-components.png

Topology UI

Used to view the current state of the business service hierarchy.

Configuration UI

Used to model and configure the business services hierarchy.

Enterprise Reporting

Generates historical reports related to business services availability.

ReST API

Enables third-party integrations.

State Machine

Overview

The state machine’s role is to maintain the state of the Business Service Hierarchies and provide hooks for other components to know when the state changes.

We can think of a Business Service Hierarchy as a tree structure with Business Services and Alarms as vertices.

Engine

The state machine engine will run as a daemon inside the OpenNMS JVM.

The engine will periodically poll and listen for alarm related life-cycle events in order maintain the state of the Business Services.

The state of the Business Services will be made available via the service layer and events will be generated when this state changes.

Model

For the purpose of the state machine, the edges and vertices of the (Business Service Hierarchy) trees will have the following attributes:

Business Service (Vertex):

  • Administrative Status
    • On or Off
  • Operational Status
    • Severity, Defaults to "Indeterminate"
  • Reduce Function
    • Used calculate the state of this vertex from the (mapped) state of all it’s children

Alarm (Vertex):

  • Severity
    • Defaults to "Indeterminate" if the alarm does not exist

Dependency (Edge):

  • Map Function
    • Used by the parent to calculate the state of a child
  • Weight
    • Relative weight of this dependency

If IP-Services are used in part of the Business Service Hierarchy we will use the alarm(s) for that service instead of the outage record.

Alarm Life-cycle

The engine needs to be able to keep track of the alarm life-cycle in order to maintain an accurate state.

The following pages provide an overview of alarms of their properties:

 http://www.opennms.org/wiki/Configuring_alarms
 http://www.opennms.org/wiki/Alarms

The life cycle of an alarm is as follows:

Bsm-alarm-lifecycle.png

The arrows indicate state changes and their labels indicate events that should be sent out after the state change is complete.

In addition to the state changes indicated above it is also possible to acknowledge and un-acknowledge an alarm, but this has not effect on the Business Service state.

The alarm state changes can be performed by several different components:

Alarmd:

  • Creates the alarms
  • Reduces events into existing alarms

Ackd:

  • Acknowledges alarms
  • Un-acknowledges alarms
  • Escalates alarms
  • Clears alarms

Vacuumd:

  • Acknowledges alarms
  • Escalates alarms
  • Clears alarms
  • Unclears alarms

WebUI:

  • Acknowledges alarms
  • Escalates alarms
  • Clears alarms
State Calculation

The severity of a Business Service is calculated by applying the map functions to all it’s children and then calling the reduce function on the weighted results.

We will provide the following built-in map functions:

  • Identity
    • Uses the severity “as-is”
  • Increase
    • Increases the severity one level
  • Decrease
    • Decreases the severity one level
  • Set To
    • Returns a constant value
  • Ignore
    • Ignores the severity of this vertex

We will provide the following built-in reduce functions:

  • Highest Severity
    • Uses the value of the highest severity
  • Threshold
    • Uses the highest severity found more often than the given threshold
  • Highest Severity Above
    • Propagades the severity only if the highest severity is greater or equal than the given threshold severity

When a business service is administratively disabled, any edges pointed to it use the "Ignore" map function.

Example

Let’s assume we have the following Business Service Hierarchy:

Bsm-bamboo-example.png

By fixing severities for the underlying alarms, the leafs on the tree, we can demonstrate how the Business Service states could be calculated.

Calculating the state of the 'Master' service:

Alarm Name Severity Mapped Severity
HTTP-8085 on bamboo Indeterminate Cleared
90% Disk Usage Threshold on bamboo Warning Warning

The reduced severity (Most Critical) is Warning

Calculating the state of the 'Agent' service:

Alarm Name Mapped Severity Weight Factor Critical Major Minor Warning Normal
Bamboo-Agent on duke Major 0.4 0 0.4 0.4 0.4 0.4
Bamboo-Agent on carolina Critical 0.4 0.4 0.4 0.4 0.4 0.4
Bamboo-Agent on ncstate Warning 0.2 0 0 0 0.2 0.2
Total 1 0.4 (40%) 0.8 (80%) 0.8 (80%) 1 (100%) 1 (100%)

The reduced severity (Threshold 75%) is Major

Calculating the state of the 'Major' service:

The reduced severity (Most Critical) is Major

Appendix A - Current State (as of Horizon 17)

OpenNMS models faults as alarms. Alarms can optionally be associated with managed elements such as nodes.

Alarms can be triggered by external events, such as SNMP traps, or by outages on IP-Services.

IP-Services can be grouped together to form an application, however the use of applications is limited to displaying aggregated availability statistics.

Alarms

  • Alarms are created and updated (with reduced events) by alarmd.
  • Alarms are updated, and deleted once they are cleared by scripts in vacuumd.
  • No events, other than the ones that triggered/resolved the alarms are sent when the state of an alarm changes.