DevProjects/Sextant

From OpenNMS
Jump to: navigation, search

Introduction

Project Sextant is a sponsored project that aims to extend OpenNMS with the ability to correlate and group relevant alarms together into higher level objects in order to help operators identify and resolve problems faster.

Architecture

Component Overview

AlarmLifecyleCorrelation v1.png

Alarm Services

AlarmServiceClasses v1a.png

Modeling

Situations

A Situation is a logical specialization of an OnmsAlarm and is used to correlate multiple alarms into a single event. It is simply an OnmsAlarm that has one or more related alarms.

Creating Situations

The main way for creating a Situation, and thereby correlating alarms is to send in an event with additional parameter for each related alarm (sending the alarms' reduction key as the value) e.g.:

When creating Situations, Alarms are related by their respective reduction keys. As an example, to create a Situation using Send-event.pl, you would need to provide the reductionKeys for the related alarms as parameters:

...

 -p 'related-reductionKey uei.value1' 
 -p 'related-reductionKey uei.value2' 
 -p 'related-reductionKey uei.value3'
# /opt/opennms/bin/send-event.pl --interface 172.16.1.1 uei.opennms.org/alarms/situation -p 'related-reductionKey uei.value1' -p 'related-reductionKey uei.value2' -p 'related-reductionKey uei.value3'

where "uei.value1", "uei.value2" and "uei.value3" are the reduction keys for the existing alarms.

If any of the related-reductionKey values fail to reduce to an alarm, the Situation will not contain that alarm - i.e. non existing alarms are ignored (therefore alarms must also exist before they can be correlated).

If a subsequent Situation event is received with a related-reductionKey for an alarm that was not previously correlated to the Situation, it will be correlated, i.e. added to the list of related alarms.

We currently don't have a method for removing Alarms via a Situation Event, but if an alarm is deleted, it will be removed from the Situation. Additionally, we had discussed possibly providing this as a method on the REST endpoint if it ends up being required.

Persistence

The related alarms are stored in the alarm_situations table in the database. situation_id identifies the Situation and is a foreign key to the alarms table. related_alarm_id identifies the related alarms and is also a foreign key to the alarms' table.

The following is an example where Alarm 13 is a Situation and has correlated 3 related Alarms: 8, 9, and 10:

Situation13.PNG

HELM

Once persisted, Situations are available through HELM and can be filtered on when setting up Alarm Table panels:

Grafana.isSituation.filter.PNG

Additionally, once visible in HELM, by double clicking on the Situation and choosing the Related Alarms tab, the operator can see a summary of related alarms:

Grafana.related.alarms.summary.PNG

Resources

Overview

In order to perform logical grouping of alarms we would like to add some context to the alarm objects to help identify "which physical or logical component the alarm is related to". We are currently calling this physical or logical component "the resource". Alarms are associated with a single resource and resources may be related to each other to form a directed acyclic graph (DAG).

Associating alarms to resources in this fashion allows us to infer how alarms associated to different (or even the same) resource relate to one another.

Examples

He were look at various scenarios observed with real network equipment where we receive and trigger a number alarms that should be associated with the same resource. These should allow to help design facilities for performing the resource tagging and association.

Cisco NX-OS - Power Supply Failure

Ref #4016363

Overview

Here are a series of events extracted from a single ticket (correlated alarm) from another NMS:

Service checks:

  • time:June 21st 2018, 21:37:48.000 source:service severity:Major description:Power Supply out detailedDescription: PowerSupply Module 2- N2200-PAC-400W Out
  • time:June 21st 2018, 21:38:43.000 source:service severity:Cleared description:Power Supply in detailedDescription: PowerSupply Module 2- N2200-PAC-400W

SNMP Traps:

  • time:June 21st 2018, 21:44:49.000 source:trap severity:Major trapTypeOid:.1.3.6.1.4.1.9.9.117.2.0.2 description:cefc power status down
  • time:June 21st 2018, 21:44:49.000 source:trap severity:Cleared trapTypeOid:.1.3.6.1.4.1.9.9.117.2.0.3 description:cefc FRU inserted
  • time:June 21st 2018, 21:44:49.000 source:trap severity:Cleared trapTypeOid:.1.3.6.1.4.1.9.9.117.2.0.2 description:cefc power status up

Syslog messages:

  • <186>: 2018 Jun 21 21:44:28 CDT: %PFMA-2-FEX_PS_REMOVE: Fex 118 Power Supply 2 removed (Serial number REDACTED)
  • <186>: 2018 Jun 21 21:44:48 CDT: %PFMA-2-FEX_PS_FOUND: Fex 118 Power Supply 2 found (Serial number REDACTED)

In this case there were a total of 4 different alarms from 3 distinct sources (one for each trap type) which all related to the same power supply.

Analysis

The method used to perform service check is not known, but is likely performed by some means of shell scraping or API polling. Given the 'entPhysicalIndex' of the PSU, it is possible to poll its operation status via the 'CISCO-ENTITY-FRU-CONTROL-MIB' MIB.

According to the MIB for the given SNMP traps: The varbind for this notification indicates the entPhysicalIndex of the FRU, and the new operational-status of the FRU.

From the Syslog messages we can parse both the FEX and slot #.

Given the FEX and slot # we could match this to the corresponding 'entPhysicalAlias' from the ENTITY-MIB and identify the 'entPhysicalIndex' for the element contained in that slot. This would allow us to relate the syslog messages to the same identifier that we have in both the trap and service based checks.

Summary:

  • Service: entPhysicalIndex
  • SNMP Trap: entPhysicalIndex
  • Syslog message: We have the name of the power supply bay 'Fex 118 Power Supply 2'

Use the 'ENTITY-MIB' to find the 'entPhysicalIndex' for the PSU contained in the given bay.

Device Details (SNMP)

Subset of MIB-2 System:

SNMPv2-MIB::sysDescr.0 = STRING: Cisco NX-OS(tm) n6000, Software (n6000-uk9)
SNMPv2-MIB::sysObjectID.0 = OID: SNMPv2-SMI::enterprises.9.12.3.1.3.1410

Subset of ENTITY-MIB:

.1.3.6.1.2.1.47.1.1.1.1.2.998101 = STRING: "Fex-101 Nexus2248 Chassis"
.1.3.6.1.2.1.47.1.1.1.1.2.118000022 = STRING: "Fex-118 Fabric Extender Module: 48x1GE, 4x10GE in FixedModule-1"
.1.3.6.1.2.1.47.1.1.1.1.2.118000214 = STRING: "Fex-118 Module-1"
.1.3.6.1.2.1.47.1.1.1.1.2.118000278 = STRING: "Fex-118 PowerSupplyBay-1"
.1.3.6.1.2.1.47.1.1.1.1.2.118000279 = STRING: "Fex-118 PowerSupplyBay-2"
.1.3.6.1.2.1.47.1.1.1.1.2.101000471 = STRING: "Fex-101 A/C,110/220v 400W"
.1.3.6.1.2.1.47.1.1.1.1.2.101000536 = STRING: "Fex-101 PowerSupply-2 Fan-1"
...
.1.3.6.1.2.1.47.1.1.1.1.4.101000279 = INTEGER: 998101
.1.3.6.1.2.1.47.1.1.1.1.4.101000471 = INTEGER: 101000279
.1.3.6.1.2.1.47.1.1.1.1.4.101000536 = INTEGER: 101000471

Subset of CISCO-ENTITY-FRU-CONTROL-MIB:

# snmpwalk -On -c 'REDACTED' -v 2c 10.0.0.1 .1.3.6.1.4.1.9.9.117.1.1 | grep 101000471
.1.3.6.1.4.1.9.9.117.1.1.1.1.1.101000471 = INTEGER: 2 # (cefcPowerRedundancyMode - redundant(2))
.1.3.6.1.4.1.9.9.117.1.1.1.1.2.101000471 = STRING: "CentiAmps @ 12V" # (cefcPowerUnits)
.1.3.6.1.4.1.9.9.117.1.1.1.1.3.101000471 = INTEGER: 3300 # (cefcTotalAvailableCurrent)
.1.3.6.1.4.1.9.9.117.1.1.1.1.4.101000471 = INTEGER: 360 # (cefcTotalDrawnCurrent)
.1.3.6.1.4.1.9.9.117.1.1.2.1.1.101000471 = INTEGER: 1 # (cefcFRUPowerAdminStatus - on(1): Turn FRU on, off(2): Turn FRU off.)
.1.3.6.1.4.1.9.9.117.1.1.2.1.2.101000471 = INTEGER: 2 # (cefcFRUPowerOperStatus - on(2): FRU is powered on.)
.1.3.6.1.4.1.9.9.117.1.1.2.1.3.101000471 = INTEGER: 360 # (cefcFRUCurrent)

Link Down

Ref #4015990

Overview

Here are a series of events extracted from a single ticket (correlated alarm) from another NMS:

Service checks:

  • time:June 21st 2018, 19:52:32.000 source:service severity:Major description:Port down due to oper detailedDescription:Port Down due to Oper Status down
  • time:June 21st 2018, 20:56:45.000 source:service severity:Cleared description:Port down due to oper - Cleared due to ForceClear detailedDescription:Port down due to oper - Cleared due to ForceClear

SNMP Traps:

  • time:June 21st 2018, 19:51:46.000 source:trap severity:Minor trapTypeOid:.1.3.6.1.6.3.1.1.5.3 description:SNMP Link down
  • time:June 21st 2018, 20:56:45.000 source:trap severity:Cleared trapTypeOid:N/A description:SNMP Link down - Cleared due to ForceClear location

Syslog messages:

  • <189>: 2018 Jun 21 19:51:41 CDT: %FEX-5-FEX_PORT_STATUS_NOTI: Uplink-ID 2 of Fex 109 that is connected with Ethernet1/22 changed its status from Active to Disconnected
  • <189>: 2018 Jun 21 19:51:42 CDT: %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel900: Ethernet1/22 is down
  • <189>: 2018 Jun 21 19:51:42 CDT: %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/22 is down (Link failure)
Analysis

In the case of the service check and SNMP traps, we know the `ifIndex` of the related interface.

For Syslog messages, we can parse the `ifAlias`.

Summary:

  • Service: ifIndex
  • SNMP Trap: ifIndex
  • Syslog: ifAlias

Use the 'ifTable' to map the 'ifAlias' to the 'ifIndex'.

BGP Peer Down

Ref #4018653

Overview

Here are a series of events extracted from a single ticket (correlated alarm) from another NMS:

Service checks:

  • time:June 22nd 2018, 19:01:43.000 source:service severity:Major description:BGP neighbor loss VRF due to oper detailedDescription: BGP Neighbor connection lost between 10.0.0.1 and 10.0.0.2
  • time:June 22nd 2018, 19:02:14.000 source:service severity:Cleared description:BGP neighbor found detailedDescription:BGP Neighbor connection restablished between 10.0.0.1 and 10.0.0.2

SNMP Traps:

  • time:June 22nd 2018, 19:01:22.000 source:trap severity:Major trapTypeOid:.1.3.6.1.2.1.15.7.2 description:BGP down trap
  • time:June 22nd 2018, 19:01:22.000 source:trap severity:Information trapTypeOid:.1.3.6.1.4.1.9.9.187.0.2 description:Cisco BGP backward transition trap
  • time:June 22nd 2018, 19:01:22.000 source:trap severity:Major trapTypeOid:.1.3.6.1.4.1.9.9.187.0.1 description:Cisco BGP down trap
  • time:June 22nd 2018, 19:01:22.000 source:trap severity:Information trapTypeOid:.1.3.6.1.4.1.9.9.187.0.1 description:Cisco BGP FSM state changed trap

Syslog messages:

  • <187>1000865: 1000865: Jun 22 19:01:21.506 CDT: %BGP-3-NOTIFICATION: sent to neighbor 10.0.0.1 4/0 (hold time expired) 0 bytes
  • <189>1000867: 1000867: Jun 22 19:01:21.550 CDT: %BGP-5-ADJCHANGE: neighbor 10.0.0.1 vpn vrf REDACTED Down BGP Notification sent
  • <189>1000869: 1000869: Jun 22 19:02:12.913 CDT: %BGP-5-ADJCHANGE: neighbor 10.0.0.1 vpn vrf REDACTED Up
Analysis

In each of these case we are given the peer address, and in some cases we also have the VRF.

Correlation Engines

Correlation engines are currently maintained in a source tree to help accelerate development. See OpenNMS Correlation Engine for details.

The goal is to allow the correlation engines to run either within the same JVM as OpenNMS (for ease and use) or as a standalone application (for larger deployment).

Temporal

This engine groups alarms together based on whether or not their occurred in the same time window.

Cluster

This engine associates alarms with vertices on a DAG and clusters these using the DBScan algorithm using a distance function that takes into account notions of both space and time. The spatial component is defined a the number of hops between vertices. The time component is based on the different between the time at which both alarms were first observed.

Setup

Single Instance

  • Install OpenNMS Horizon 23 or later.
  • Download and compile (mvn install) a copy of https://github.com/OpenNMS/oce, on the same system as the OpenNMS instance
  • From the Karaf shell run:
    • feature:repo-add mvn:org.opennms.oce/oce-karaf-features/1.0.0-SNAPSHOT/xml
    • feature:install opennms-oce-plugin

Running the alarm audit

In order to make sure that the system is configured properly and can accurately handle a known set of alarms, we provide an audit tool that will similulate payloads for a number of SNMP trap and syslog message definitions. This tool currently makes some assumptions about the system.

Before running the tool:

  • Enable syslogd
  • Update syslogd to use the org.opennms.netmgt.syslogd.RadixTreeSyslogParser parser
  • Provision a node:
$OPENNMS_HOME/bin/provision.pl requisition add NODES
$OPENNMS_HOME/bin/provision.pl node add NODES localhost localhost
$OPENNMS_HOME/bin/provision.pl interface add NODES localhost 127.0.0.1
$OPENNMS_HOME/bin/provision.pl requisition import NODES