DevProjects/Status Box

From OpenNMS
Jump to: navigation, search

Motivation

OpenNMS today is a tool which is mostly used and operated by a system and/or network administrators what we call our users. The OpenNMS start page use case is to give a user information about the status overview of the network, servers, applications and systems which are in monitoring. Especially when it comes to large environments with thousands of nodes there are visualization elements which don't scale.

Status-box-feature-old-lists.png

The most problematic lists are:

  • Nodes with Pending Problems This is a list of all Nodes with unacknowledged alarms. I personally don't know why the box is not named like that. "Pending Problems" can mean anything. "Unacknowledged Alarms" is much more transparent and gives the user a better hint what that really means and implicitly gives a hint, they will disappear when they acknowledge an alarm. The bigger problem is the list shows just the last 16 which can be configured in the opennms.properties. In medium-sized networks with 1000 to 5000 nodes, the number of alarms increases and the box becomes quite soon very useless. Alarms can be anything, from a node, interface, service outages to threshold alarms, Syslog or SNMP trap alarms.
  • Nodes with Outages: This is a list of all Nodes with ongoing current outages. Outages in OpenNMS is only related to node, interface and services. Most of the time this is a smaller amount of items than alarms. The list shows the name of the node and they are just ordered by the time the outage occurred. The severity and impact of the outage are not reflected. The box is only useful in networks with <500 nodes in monitoring.
  • Applications is a relict from the pre-Business Service functionality. It allows modeling an application which delivers services across multiple servers and was used to monitor "Applications" from different network perspectives with the Remote Poller. By adding the status on the start page the functionality and concept are now overloaded with a status view for these modeled applications. The Business Service functionality supersedes this function but it is still available in the product. It should be deprecated and can be replaced by the Business Service feature. The distributed monitoring status for applications can be modeled from remote poller alarms as well but need to be addressed in a different article.
  • Business Service list box was added with the Business Service Monitoring (BSM) feature.

Suggested Solution

The feature status-box enhancement addresses this topic by introducing a status overview for Business Services, Applications, Alarms and Outages. It also enhances the workflow to allow to filter for applications which are in a normal or faulty status which is currently not implemented in the application box.

Beside showing just faulty status, they give at a glance view about all Business Services, Applications and Nodes in the system. The administrator gets a quick impression about how many are affected by the total amount in the inventory.

The order from left to right is the priority an administrator has to deal with issues.

  1. Business Services: give high-level status about something which impacts your business and needs to be addressed in the highest priority.
  2. Applications: Are critically monitored applications have a probability to impact or degrade services provided in the network.
  3. Alarms: All other outstanding alarms where nobody took care of
  4. Outages: Show all current outages seen by the network monitoring system.

To address the visual scalability and giving an impression about "Ok-to-Not-Ok-relations" donut charts are used as shown in the screenshot below.

Status-overview.png

For each model, a status is calculated based on the described entity. The tiles of the chart can be clicked and lead to detailed lists which show detailed Business Services by severity. Specific actions on the detailed list lead to the topology UI which shows the hierarchy of a Business Service or Application. On the Alarm and Outages the node detail page can be used as an action to give a more detailed view on a certain outage. Additionally, the status box allows removing items from the status calculation by clicking the severity icon in the chart legend.

Improvements

With talking about the feature in the opennms-discuss list, there is a functionality loss by removing the Alarm, Outage, Application and Business Service list boxes. The front page shows without human interaction with auto-refresh the newest Alarms, Outages. This function is lost when the Top N list boxes are replaced with an aggregated view. To address this problem the default behavior is changed to the three column view and the "Application" is removed for space issues. We have decided to remove the Application cause it can be replaced by modeling with the Business Services and does not need a dedicated visualisation element.

Combined-view.png

For a user it is easy to use the org.opennms.web.console.centerUrl attribute in opennms.properties to remove the status box overview if it does not work for the user.

Further Ideas

Suggestions made during the discussion and worth considering:

Combination of both

The status box should have some mouse over / hover in the "rings" which opens a list with the latest outages/events from the hovered section with links to the affected nodes. This links should not open in the same window but in a popup or in a new tab. So you don’t loose focus on the dashboard.

Tagged grouping and aggregation

Usage of surveillance categories as tag to build groups of nodes and aggregation for outages to show something like "Top 10 locations with outages"

  • Fulda (2/21 nodes)
  • Stuttgart (1/40 nodes)

Branches and Contact