Making OpenNMS highly available

From OpenNMS
Jump to navigation Jump to search

Making OpenNMS highly available

This page describes how to use OpenNMS in an HA cluster consisting of pacemaker and corosync. The example will also use DRBD, but that is an optional component used because the author does not have access to a shared storage device.

Motivation

Why would you want to do this? Mainly because you do not want to sit there, being called by your customers or your boss about something being down just because some component of OpenNMS (be it Hardware, Software, Database related) failed and did not automatically recover and therefore not notify you about the problem.

The basic need is the need to monitor the monitoring system. With this How-To, you can make it highly available instead of just monitoring it.

Abstract and terms

An HA cluster is a group of computers, nodes, providing a service. A service is typically formed by several resources. A resource is managed by a resource agent, which in most cases is basically a shell script. A resource agent is the interface for the cluster resource manager to manage the resource.

The most widely used cluster configuration is the 2-node-cluster, which is - well - a cluster formed by 2 nodes. And with 2-node-clusters, the most common configuration is to have one active node, that runs your service, and one passive node, that will jump into place if the first one should fail.

Pacemaker abstract

In pacemaker, you configure resources and rules (constraints). You basically say "Okay, so I have a webserver, a database and an IP address. I want the webserver and the IP address to run on node cliff, the database to run on node jason. If not possible otherwise, they may also run on the same node. Oh, oh, and if the network connection of a node should fail, please move everything away from that node to one with a working connection."

What pacemaker basically does is to execute shell scripts for your resources and evaluate the return code. If the return code does not match the expected return code, pacemaker computes a path of actions that have to be taken in order to meet the policy you configured. Let's take OpenNMS and its underlying PostgreSQL database as an example: If the database should die, it is quite likely that OpenNMS also needs a restart after PostgreSQL has been restarted. So the path would be:

  1. stop opennms
  2. stop postgresql (in order to make sure it is shutdown safely)
  3. start postgres
  4. start opennms

In order to start a resource, it will execute the configured resource agent with the parameter "start". It will stop a resource by calling the script with the "stop" parameter and in order to monitor it - well - it will call it with the "monitor" parameter.

There are two cases in which pacemaker needs help of a hardware device

  1. A node does not respond
  2. A stop operation failed

In these situations, pacemaker has no way of knowing the current state of the node. Therefore it powercycles the node and tries to start the resources elsewhere. If this sounds a bit harsh to you, read #stonith.

Prerequisites

In order to run OpenNMS in an HA cluster, you need to meet certain prerequisites.

Hardware Requirements
  1. You need at least 2 nodes
  2. Each node needs at least 2 NICs
  3. You need some sort of shared storage for the configuration- and datacollection files (if you do not have a shared storage device, you can use #DRBD)
  4. It is strongly encouraged to run the cluster nodes on managable power supplies (see #stonith)
Software Requirements
  1. Corosync or heartbeat (corosync in this How-To)
  2. pacemaker
  3. Make sure OpenNMS' init-script is LSB compliant (this issue was resolved by NMS-3503)

Corosync and heartbeat (you need to choose one) will deal with node level failure and establish the communication channels between the nodes, pacemaker will deal with resource level failure and resource management.

DRBD

DRBD is an open source project primarilly run by Linbit (Vienna/Austria).

If you do not have a shared storage device, you can use DRBD in order to have your configuration- and datacollection files accessible on both nodes. DRBD is basically a RAID-1 over ethernet. It will sync everything you write to your active node over to your passive node in real time. At each point in time, it will ensure that the data on the active node is identical to the data on the passive node.

It is completely oblivious to applications, which makes it an affordable way to basically have "everything" highly available. It sits between the Application and the lower level storage device, which is illustrated in the following pictures.

DRBD resources always are in one of two roles: Primary or secondary. Only a resource in the primary role is writable. The secondary role is not even readable.

It has a couple of really nice features like the online data verification, with which you can make sure the data on both nodes is identical without taking the data offline.

Another great feature is the data digest algorythm. It will make sure the data that was sent from memory is actually the same as the data that was sent over the wire. With bad NICs/drivers it may happen that bits switch between your memory and wire. That would corrupt your replicated data. DRBD can detect and recover from this.

Additional information on DRBD is available in #Documentation

IP addresses

One thing you have to think about when it comes to clustering OpenNMS is the nodes' IP addresses. Each node will obviously have its own "physical" IP address. If you'd just start OpenNMS on another node without also moving the IP address, you're likely to run into problems, most likely SNMP-based.

SNMP-agents are typically configured to send traps to one destination and they also are very often set up to just allow access from one source address. So if you just moved OpenNMS to another machine (that has another IP address), it is very likely that you will not receive traps and that you will not be able to perform SNMP data collection from that node.

Therefore, the cluster needs to be configured with a "floating" (some say "virtual") IP address. Moving this over to the other node alongside with OpenNMS will make sure you will receive traps on that node. But it will not necessarily also allow you to query SNMP. By default, additional IP addresses are not used for outgoing connections. The source IP of outgoing connections is usually the "physical" (or "primary") IP address. So the cluster also needs to be configured with a rule that makes outgoing connections use the floating IP.

Install software

The software usually ships with most of the linux distributions. The current versions of OpenSuSE, RedHat and Ubuntu have it. If you are having trouble finding packages, have a look at the clusterlabs install page.

Actual configuration

Starting situation

We have a node "cliff", which is our current OpenNMS host. Its OpenNMS is living in /opt/opennms, its PostgreSQL is living in /opt/postgres, it has /dev/sda3 mounted to /opt with ~80 GB of space and it is configured with IP address 10.2.50.106. SNMP-agents allow connections from this IP and send traps to this IP only.

We do not have a shared storage device.

Opennmsstart.png

Target situation

Opennmstarget.png

We have a new node "jason". The data is being replicated using DRBD. If OpenNMS cannot run on cliff, it shall be started on jason. Traps and SNMP GET requests must also work on jason.

First steps

First of all, you need to configure a new primary IP address for cliff. How that is done depends on your distribution. On my OpenSuSE linux, I needed to edit /etc/sysconfig/network/ifcfg-br0 (which is cliffs default network connection) and then run "rcnetwork restart".

The next step is to set up the second machine. I'd recommend using the same hardware platform, the same operating system and version and the same partitioning layout.

Prepare lower level storage for DRBD

/dev/sda3 has 80633 1M blocks.

  # df -m
  Filesystem           1M-blocks   Used Available Use% Mounted on
  /dev/sda1                40313  33239      5026 87% /
  /dev/sda3                80633  41412     35125 55% /opt

In order for DRBD to write its meta-data on that device, we need to shrink the file system. As of this writing, DRBD will not need more than 128 MB of space for its meta-data, so this example actually is a bit too generous.

  # fsck -f /dev/sda3
  # resize2fs -p /dev/sda3 80000M

Configure DRBD

vi /etc/drbd.conf
resource opennms {
        device          /dev/drbd0;
        disk            /dev/sda3;
        meta-disk       internal;
        on cliff {
                address 10.250.250.104:7788;
        }
        on jason {
                address 10.250.250.105:7788;
        }
}

You configure a name for the resource (opennms), a name of the drbd device (drbd0), the device file of the lower level storage (/dev/sda3), where the meta-data should live (internal) and the connection information for the drbd replication connections.

Put this file on both machines, then

modprobe drbd

The next step is to create the meta-data on the device. On cliff, drbd will warn you that it found an existing file system and that your operation might destroy data.

drbdadm create-md opennms
warning:
Found ext3 filesystem which uses 81920000 kB
current configuration leaves usable 83880800 kB
 ==> This might destroy existing data! <==
Do you want to proceed?
[need to type 'yes' to confirm] yes

As long as the number after "current configuration leaves usable" is larger than the one above, you should be fine and type "yes" at the prompt. Otherwise you should not continue here since that would destroy your filesystem. Look into #Prepare lower level storage for DRBD in order to sort this out or use external meta-data.

After you have created the meta-data (on both machines!), you can bring up the resource using the command

drbdadm up opennms

This also needs to be done on each node.

You can then look at DRBD's proc interface

cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: 61b7f4c2fc34fe3d2acf7be6bcc1fc2684708a7d build by root@cliff, 2010-04-26
08:46:57
 0: cs:Connected st:Secondary/Secondary ds:Inconsistent/Inconsistent C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:83880800

In cs you see the connection state. st tells you about the roles of the drbd peers. In ds you will find the current data status. In this situation, each peer is secondary and the data is inconsistent. This is expected as you have not yet told DRBD which side has the "good" data.

Doing this is the next step. Make absolutely sure you do this on the correct machine (cliff in this case). Otherwise you overwrite your data.

drbdadm -- --overwrite-data-of-peer primary opennms

If you then have a look at the proc interface again, you will see something like this:

cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: 61b7f4c2fc34fe3d2acf7be6bcc1fc2684708a7d build by root@cliff, 2010-04-26
08:46:57
 0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:36864 nr:0 dw:0 dr:37128 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:83843936
[>....................] sync'ed: 0.1% (81878/81914)M
finish: 0:37:33 speed: 36,864 (36,864) K/sec

The local side is now the SyncSource in the Primary role and its data is UpToDate. The peer is in Secondary role and its data is still Inconsistent, since the replication has not finished yet.

On cliff, you can now continue to build your cluster, you don't have to wait for DRBD to finish the initial replication.

ip addr add 10.2.50.106/24 dev br0
ip route replace default via 10.2.50.11 src 10.2.50.106
/etc/init.d/postgres start
/etc/init.d/opennms start

This configures the formerly primary IP address of cliff as a secondary address and uses this for outgoing traffic, too. Then PostgreSQL and OpenNMS is started.

Now you need to make sure things still work. Send some traps, see if data is collected, etc. Once you are confident things still work, shut down OpenNMS, PostgreSQL, unmount the file system, bring down the secondary IP and the demote the DRBD device to the secondary role. Then, repeat these steps on jason and make sure things also work there.

drbdadm primary opennms
mount /dev/drbd0 /opt
ip addr add 10.2.50.106/24 dev br0
ip route replace default via 10.2.50.11 src 10.2.50.106
/etc/init.d/postgres start
/etc/init.d/opennms start

At this point, you can already do manual failover. So in case the cliff node should die, you could manually start OpenNMS just as shown above.

Configure corosync

You need to generate an authentication key for corosync. That's easily done with the command

corosync-keygen

Then you need to edit /etc/corosync/corosync.conf. It comes with sane default values, you just need to adopt your network configuration:

      interface {
              ringnumber: 0
              bindnetaddr: 10.250.250.0
              mcastaddr: 226.94.1.1
              mcastport: 5405
      }

and at the end of the file, tell corosync to spawn pacemaker.

 service {
 	# Load the Pacemaker Cluster Resource Manager
 	name: pacemaker
 	ver:  0
 }

Apply these changes on both nodes.

For a complete sample configuration file, look at the clusterlabs site

You should now be able to start corosync using its init-script. Remember to do this on both nodes.

/etc/init.d/corosync start

Configuring pacemaker

In this situation, the output of the cluster resource manager monitor crm_mon should look something like this:

# crm_mon -rf1
============
Last updated: Mon Mar 8 13:48:53 2010
Stack: openais
Current DC: jason - partition with quorum
Version: 1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782
2 Nodes configured, 2 expected votes
0 Resources configured.
============
Online: [ cliff jason ]
Full list of resources:
Migration summary:
* Node cliff:
* Node jason:

We have two online nodes and - as of now - zero resources.

pacemaker is configured using the crm shell. You enter the configure mode with the command

crm configure

and after you have added or modified the configuration, you need to issue the commit command in order to save the changes.

The first thing to do in a 2-node-cluster is to configure pacemaker to ignore quorum. Quorum is the idea of having only a defined set of nodes providing the service in case of a cluster split. Say there is a problem in your network and your 5 node cluster is divided into partitions of 2 and 3 nodes. It would be bad if both partitions would now start the service. Quorum is something a cluster partition has if it is larger than 50% of the cluster. If a partition has quorum, it may run resources. Otherwise nodes have to shut down every resources they might have run at that time. This is another safety mechanism.

In a 2-node-cluster however, there is no such thing as >50% of the cluster if the nodes cannot see each other. Each node is 50% of the cluster and can never be more than that. So in order to allow a node that does not see the other node (which is likely to happen in case of a failure), you need to configure

property no-quorum-policy="ignore"

in crm configure mode.

Then you can configure DRBD as a resource for pacemaker:

#  Configure the DRBD resource
 primitive drbd-opennms ocf:linbit:drbd \
  params drbd_resource="opennms" \
  op monitor interval="15s

#  Then make it a multistate resource that runs as a master on one, as a slave on the other node
 ms ms-opennms drbd-opennms \
  meta master-max="1" master-node-max="1" \
  clone-max="2" clone-node-max="1" \
  notify="true" globally-unique="false"

The cluster resource manager monitor will now look something like this (I cut the output header here)

# crm_mon -rf1
Online: [ cliff jason ]
Full list of resources:
 Master/Slave Set: ms-opennms
     Masters: [ cliff ]
     Slaves: [ jason ]
Migration summary:
* Node cliff:
* Node jason:

The next step is to configure the dependencies for OpenNMS:

# IP address
 primitive ipaddress ocf:heartbeat:IPaddr2 \
  params ip=10.2.50.106 cidr_netmask=24 nic=br0 \
  op monitor interval="60s" timeout="40s"

# Filesystem
 primitive fs-opt ocf:heartbeat:Filesystem \
  params device="/dev/drbd0" directory="/opt" fstype="ext3" \
  op monitor interval="60s" timeout="40s"

# Postgres DB
 primitive postgres ocf:heartbeat:pgsql \
  params pgctl="/usr/bin/pg_ctl" pgdata="/opt/postgres/data" \
  logfile="/opt/postgres/data/logfile" \
  op monitor interval="60s" timeout="40s"

# Source IP rule
 primitive srcipaddress ocf:heartbeat:IPsrcaddr \
  params ipaddress="10.2.50.106" \
  op monitor interval="60s" timeout="40s"

# Group the dependencies
 group dependencies ipaddress fs-opt postgres srcipaddress

# Run group where the DRBD Master runs
 colocation dependencies-on-ms-opennms-master inf: dependencies ms-opennms:Master

# Start group after you promoted DRBD to Master mode
 order ms-opennms-promote-before-dependencies-start inf: ms-opennms:promote \
  dependencies:start

The monitor should now look something like this:

# crm_mon -rf1
Online: [ cliff jason ]
Full list of resources:
 Master/Slave Set: ms-opennms
     Masters: [ cliff ]
     Slaves: [ jason ]
 Resource Group: dependencies
     ipaddress (ocf::heartbeat:IPaddr2):        Started cliff
     fs-opt     (ocf::heartbeat:Filesystem):    Started cliff
     postgres (ocf::heartbeat:pgsql): Started cliff
     srcipaddress       (ocf::heartbeat:IPsrcaddr):     Started cliff
Migration summary:
* Node cliff:
* Node jason:

So finally, you can configure OpenNMS as a resource in pacemaker:

# OpenNMS
 primitive opennms lsb:opennms \
  op start timeout=300s \
  op stop timeout=120s \
  op monitor interval=60s timeout=40s

# Run OpenNMS on the Master
  colocation opennms-on-ms-opennms-master inf: opennms ms-opennms:Master

# Start it after the dependencies
 order dependencies-start-before-opennms-start inf: dependencies:start \
  opennms:start

The monitor will reflect the changes like this:

# crm_mon -rf1
Online: [ cliff jason ]
Full list of resources:
 Master/Slave Set: ms-opennms
     Masters: [ cliff ]
     Slaves: [ jason ]
 Resource Group: dependencies
     ipaddress (ocf::heartbeat:IPaddr2):         Started cliff
     fs-opt      (ocf::heartbeat:Filesystem):    Started cliff
     postgres (ocf::heartbeat:pgsql): Started cliff
     srcipaddress        (ocf::heartbeat:IPsrcaddr):     Started cliff
opennms     (lsb:opennms): Started cliff
Migration summary:
* Node cliff:
* Node jason:

You're almost there. What's left is to configure network connection checking. This is done by a cluster resource named the ping daemon. It continuously sends icmp echo requests to a configurable list of nodes and by the number of nodes it receives a respond from, it sets a node attribute. Then you can set rules that look at the value of this attribute in order to allow or forbid placement of resources to a node.

# Run a “ping” process to a couple of IP addresses in order to tell whether the network connection is working
 primitive pingd ocf:pacemaker:pingd \
  params host_list="10.2.50.12 10.2.50.11 10.2.50.19 10.2.50.40" \
  op monitor interval=60s timeout=40s

# Run this ping process on all nodes (clone it)
  clone cl-pingd pingd

# Locate the Master role of DRBD to a node where the ping attribute is >0
 location ms-opennms-master-connected ms-opennms \
  rule $id="ms-opennms-master-connected-rule-1" \
  $role="Master" -inf: not_defined pingd or pingd lte 0

Since I configured 4 ping nodes, the attribute "pingd" has a value of 4 on both nodes now (see th last 2 lines):

# crm_mon -rf1
Online: [ cliff jason ]
Full list of resources:
 Master/Slave Set: ms-opennms
     Masters: [ cliff ]
     Slaves: [ jason ]
 Resource Group: dependencies
     ipaddress (ocf::heartbeat:IPaddr2):         Started cliff
     fs-opt      (ocf::heartbeat:Filesystem):    Started cliff
     postgres (ocf::heartbeat:pgsql): Started cliff
     srcipaddress        (ocf::heartbeat:IPsrcaddr):     Started cliff
opennms     (lsb:opennms): Started cliff
 Clone Set: cl-pingd
     Started: [ cliff jason ]
apcstonith-cliff (stonith:apcmastersnmp): Started jason
apcstonith-jason (stonith:apcmastersnmp): Started cliff
Migration summary:
* Node cliff: pingd=4
* Node jason: pingd=4

The last thing you need to do is to configure your stonith devices. While this is hardware dependent, my example will likely not help you, but i'll post it for completeness:

# Configure the stonith devices (depends on your hardware)
 primitive apcstonith-cliff stonith:apcmastersnmp params community="testcom"
  port="161" ipaddr="10.2.50.154" op monitor interval="3600" timeout="120" op
  start requires="nothing"
 primitive apcstonith-jason stonith:apcmastersnmp params community="testcom"
  port="161" ipaddr="10.2.50.155" op monitor interval="3600" timeout="120" op
  start requires="nothing"

Start testing

Now you can start testing your cluster.

Pull network cables, pull power plugs, kill processes, shutdown switches. The more you test, the better you will understand the cluster and what it will do in which situation.

Documentation

http://www.clusterlabs.org

http://www.drbd.org

Support

Commercial support for pacemaker, corosync and drbd is available from Linbit, Novell and RedHat (this list is alphabetically sorted and incomplete!).

stonith

A question people tend to ask when they hear about pacemaker powercycling nodes in case of certain failures is "is this really necessary". And I'd just like to ask a question in return: How do you know what is going on on a node that you cannot login to? Say the node is not responding to ping, it is not accepting an SSH connection.

The only thing you can do here is to assume what is going on there.

  • Maybe it is still running "something"
  • Maybe it is still using the shared data
  • Maybe it really does not do anything

But there is no way to know what is really happening there.

The same applies for a resource that has just failed to stop. What is the cluster supposed to do about this? Just pretend it worked and probably damage your data by starting the resource elsewhere resulting in concurrent access? Try "stop" again?

In my - and in the one of many cluster developers - opinion, there is no way of knowing what is going on there. If you know a way - let us know!

So the cluster needs a mechanism that makes sure this node does not access the shared data before it starts the service anywhere else. Having concurrent access on shared data (like an ext3 filesystem for example) might otherwise blow things to bits within seconds. This is a mechanism called Fencing. Some clusters implement this by managing (turning off) switch or SAN ports, pacemaker implements it by powercycling a node.

If you pull the power plug, the machine definitely does not do anything anymore and you can safely start using the data on another node. It turns the assumption "the node is dead" into a fact.

The cluster component doing this is called "STONITH", which is short for "shoot the other node in the head".

This will protect your data. You can disable it, but you are strongly encouraged not to.