Path Outage How-To

From OpenNMS
Jump to navigation Jump to search

Introduction

This feature addresses the need to suppress notifications for nodes that appear to be down to the OpenNMS system due to a failure in the network path between the nodes and OpenNMS. For example, if a WAN link fails, all nodes at the remote site served by the WAN link will appear to be down. Since we will get a notification that the router on the far end of the WAN link is not responding, we don't need notifications for all the devices that sit behind that router.

Continuing with the above example, if we test the IP address of the remote router's WAN interface when a node at the remote site fails to respond, then we know whether or not the path to the remote site is up, and we can determine whether or not to send a notification for the node that is not responding. The IP address that we will test is called the Critical Path IP Address for the node. All nodes that sit behind the remote site router can be configured with this critical path IP address to prevent a storm of notifications when the critical path IP address is not responding.

This feature is available beginning with release 1.3.2.

Configuring Path Outage

For an Individual Node

From the OpenNMS Node page, select the Admin link, then Configure Notifications and Configure Path Outage link. Enter the critical path IP address in the box provided, and click the Submit button. This will set the critical path, overriding any critical path that may have been set previously for this node. To delete the critical path for this node, click the Delete button. It is not necessary to fill in the ip address for the delete operation.

Rule-Based Configuration

For a group of nodes that can be defined by a rule, it is more convenient to use rule-based configuration. From the main navigation bar, select the Admin link, then Configure Notifications, then Configure Path Outages. Enter the critical path IP address in the box provided. In the Current Rule: box enter the rule that defines a group of nodes for this critical path IP. Typically this would be an IPADDR IPLIKE rule defining a set of IP addresses, but any legitimate rule will work, for example nodelabel LIKE 'foobar%' will match all nodes with node labels beginning with foobar.

Click the Validate Rule link at the bottom of the page to test the rule. If you checked the Show matching node list box, a list of nodes matching the rule will also be shown. If the rule is invalid you will be returned to the page to correct the rule and try again. If you are satisfied with the results of the rule validation, click the Finish link. Otherwise click the Rebuild link to modify your rule.

To delete the critical path for a group of nodes, leave the critical path IP address blank. This will clear any critical path IP address that may have been previously set for nodes matching the rule.

Note 1: I'm not entirely sure about this, but I think one should make sure that the rule doesn't include the node to which the critical path IP address belongs to. Otherwise, the same situation as the one explained for the defaultCriticalPathIP node in #Configuration Tips below would occur for that node.

Note 2: This method is more efficient but has a major shortcoming: it does not remember the rule. Future nodes added to OpenNMS will not have the critical path automatically assigned to them. You must record your rules outside OpenNMS and routinely copy and paste them into the interface to keep critical paths up to date for new nodes. Also, if nodes are moved between networks their critical path may be incorrect until they’re manually updated. See also http://opennms.dougbakewell.ca/posts/set-critical-paths

Global Configuration

There are two things to do here. First, if you want nodeDown notifications suppressed when caused by a path outage you must enable this behaviour by adding the following line to the poller-configuration tag in etc/poller-configuration.xml:

    pathOutageEnabled="true"

You may also set some optional global parameters in etc/opennms-server.xml, like so:

   <local-server server-name="nms1"
       defaultCriticalPathIp="192.168.0.1"
       defaultCriticalPathService="ICMP"
       defaultCriticalPathTimeout="1000"
       defaultCriticalPathRetries="1"
       verify-server="false">
   </local-server>

If the defaultCriticalPathIp is present, it is used as the critical path for all nodes that have not had a critical path specifically set. Typically you would set this to the IP address of your OpenNMS system's default gateway. Presently ICMP is the only service that can be used to test critical paths. This may change as this feature is enhanced. The defaultCriticalPathTimeout and defaultCriticalPathRetries determine how long we wait for a response when testing critical paths, and how many times we retry.

You can alternately add OpenNMS's default gateway as the initial "critical path outage" for all nodes. However you want to ensure you configure this path first since it will overwrite any previous "critical path" entries. When entering "critical path outage" information, start with the broadest possible outage path, and then begin to get more granular. Otherwise OpenNMS will overwrite the granular entries, in favor of the broader entries.

Configuration Tips

1. A unique situation arises when the node with the defaultCriticalPathIp goes down. The defaultCriticalPathIp will be tested and will not respond, and thus a node down notification for this node will not be sent. To avoid this situation, manually configure a criticalPathIp on this node, and set it to an address that will always respond to a ping. Your server's loopback address is a good choice, usually 127.0.0.1.

2. Note that nodeUp notifications are NOT suppressed when the path outage situation is resolved. To receive "up" notifications for nodes that were truly down while avoiding a flood of unwanted nodeUp notifications due to path outages, don't use nodeUp notifications. Instead set autoNotify to "on" in your destination paths for the nodeDown notifications. See AutoNotify How-To for more information.

Path Outage in provisioning requisitions

To configure the critical path for a requisitioned (and regularly re-imported) node, the before-mentioned ways to configure do not work. If you configured it as mentioned, every time provisiond imports the group, the path outage configuration will be erased. Instead you have to put the definition into the imported xml and define the parent-foreign-id or parent-node-label. See the example below:

    <node node-label="node.one" foreign-id="11111111">
        <interface status="1" snmp-primary="P" managed="true" ip-addr="1.1.1.1">
            <monitored-service service-name="ICMP"/>
        </interface>
    </node>
    <node node-label="node.two" foreign-id="22222222" parent-foreign-id="11111111">
        <interface status="1" snmp-primary="P" managed="true" ip-addr="2.2.2.2">
            <monitored-service service-name="ICMP"/>
        </interface>
    </node>
    <node node-label="node.three" foreign-id="33333333" parent-node-label="node.two">
        <interface status="1" snmp-primary="P" managed="true" ip-addr="3.3.3.3">
            <monitored-service service-name="ICMP"/>
        </interface>
    </node>

This will produce pathoutages like this:

opennms=# select * from pathoutage ;
 nodeid | criticalpathip | criticalpathservicename 
--------+----------------+-------------------------
  11669 | 1.1.1.1        | ICMP
  11670 | 2.2.2.2        | ICMP
(2 rows)

Viewing Path Outage Status

From the main navigation bar select Path Outages. You will see a table with each row showing a Critical Path Node, Critical Path IP, Critical Path Service (always ICMP in phase I), and the Number of Nodes dependent on this critical path. The background color for the Critical Path Service indicates the status of the service.

To view a list of nodes dependent on a given path, click on the number in the right-hand column, You will see a table of nodes and status.The background color in the status column for each node indicates the status of the managed services on the node.

Some Additional Details (For Geeks Only)

NodeDown events are generated for all node down situations, whether due to a path outage or not. If a nodeDown event occurs because of a path outage, nodeDown notifications are suppressed for this node and two additional things happen. First, three more parameters are added to the nodeDown event:

   eventReason=pathOutage
   criticalPathIp=<Critical Path IP Address>
   criticalPathServiceName=ICMP

And Second, an additional event with eventuei uei.opennms.org/nodes/pathOutage is created. This event has four parameters:

   nodelabel=<node label>
   criticalPathIp=<Critical Path IP Address>
   criticalPathServiceName=ICMP
   noticeSupressed=<true|false>

The noticeSupressed parameter indicates whether or not the nodeDown event matched any notices in the notifications.xml file. The pathOutage event could be used to create some alternate notification, say e-mail instead of page, and the parameters can be tested in the notification for further fine tuning. (See Using A Parameter for Notifications in the release notes.)