Linux Mdadm monitoring thru snmp

From OpenNMS
Jump to: navigation, search

Introduction

We were looking forward for a decent solution to monitor our software raid (mdadm) server running under Linux (Debian & Centos to be specific). Many different implementations were available, we wanted to pick one that respect the spirit of OpenNMS avoiding a script to be called on each check. Snmp polling is really performant and scalable in OpenNMS so it made sense to us to find a solution using snmp. The other problem is that we wanted an easy to implement solution and we have found this excellent project: https://github.com/stefansaraev/snmp-swraid (which does not exist anymore).

This implementation works as a net-snmp module. Basicly, we compiled it once on a debian server with the correct dependencies then copied the .so file on each node with a software raid array running.

Compilation

  • Dependencies on Debian: libsnmp-dev
apt-get install libsnmp-dev
  • Check it out using Git:
git clone git@github.com:opennms-config-modules/snmp-swraid.git
cd snmp-swraid
ln -s /usr/include/net-snmp .
make

You'll get: swRaidPlugin.so

Server to monitor with mdadm/net-snmp

  • Copy MIB and library:
cp -ar SWRAID-MIB.txt /usr/share/mibs/netsnmp/
cp -ar swRaidPlugin.so /usr/lib/
  • Edit /etc/snmp/snmpd.conf file and add:
  dlmod swRaidMIB /usr/lib/swRaidPlugin.so
  • Restart SNMPD service
systemctl restart snmpd

Then you can interrogate the OID:

 
snmpwalk -v2c -c public your-mdadm.server.tld .1.3.6.1.4.1.2021.13.18
 iso.3.6.1.4.1.2021.13.18.1.1.1.1 = INTEGER: 1
 iso.3.6.1.4.1.2021.13.18.1.1.1.2 = INTEGER: 2
 iso.3.6.1.4.1.2021.13.18.1.1.2.1 = STRING: "md1"
 iso.3.6.1.4.1.2021.13.18.1.1.2.2 = STRING: "md0"
 iso.3.6.1.4.1.2021.13.18.1.1.3.1 = STRING: "raid1"
 iso.3.6.1.4.1.2021.13.18.1.1.3.2 = STRING: "raid1"
 iso.3.6.1.4.1.2021.13.18.1.1.4.1 = STRING: "sdb3[0] sda3[1]"
 iso.3.6.1.4.1.2021.13.18.1.1.4.2 = STRING: "sdb1[0] sda1[1]"
 iso.3.6.1.4.1.2021.13.18.1.1.5.1 = INTEGER: 2
 iso.3.6.1.4.1.2021.13.18.1.1.5.2 = INTEGER: 2
 iso.3.6.1.4.1.2021.13.18.1.1.6.1 = INTEGER: 2
 iso.3.6.1.4.1.2021.13.18.1.1.6.2 = INTEGER: 2
 iso.3.6.1.4.1.2021.13.18.100.0 = INTEGER: 0
 iso.3.6.1.4.1.2021.13.18.101.0 = ""

Integration with OpenNMS

Depending on your environment, you can use generic poller configuration which tells you, if any MDADM volume has issues or a more detailed one which is able to show exactly which MDADM volume is down.


Note.png Datacollection Hint

OpenNMS needs a restart to apply changes.


Detailed Version

Modify /etc/opennms/poller-configuration.xml and add

        <service name="Mdadm_Array_1" interval="300000"
            user-defined="false" status="on">
            <parameter key="retry" value="1"/>
            <parameter key="timeout" value="3000"/>
            <parameter key="port" value="161"/>
            <parameter key="oid" value=".1.3.6.1.4.1.2021.13.18.1.1.6.1"/>
            <parameter key="operator" value="&lt;="/>
            <parameter key="operand" value="2"/>
        </service>
        <service name="Mdadm_Array_2" interval="300000"
            user-defined="false" status="on">
            <parameter key="retry" value="1"/>
            <parameter key="timeout" value="3000"/>
            <parameter key="port" value="161"/>
            <parameter key="oid" value=".1.3.6.1.4.1.2021.13.18.1.1.6.2"/>
            <parameter key="operator" value="&lt;="/>
            <parameter key="operand" value="2"/>
        </service>
        <service name="Mdadm_Array_3" interval="300000"
            user-defined="false" status="on">
            <parameter key="retry" value="1"/>
            <parameter key="timeout" value="3000"/>
            <parameter key="port" value="161"/>
            <parameter key="oid" value=".1.3.6.1.4.1.2021.13.18.1.1.6.3"/>
            <parameter key="operator" value="&lt;="/>
            <parameter key="operand" value="2"/>
        </service>
        <service name="Mdadm_Array_4" interval="300000"
            user-defined="false" status="on">
            <parameter key="retry" value="1"/>
            <parameter key="timeout" value="3000"/>
            <parameter key="port" value="161"/>
            <parameter key="oid" value=".1.3.6.1.4.1.2021.13.18.1.1.6.4"/>
            <parameter key="operator" value="&lt;="/>
            <parameter key="operand" value="2"/>
        </service>

    <monitor service="Mdadm_Array_1" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>
    <monitor service="Mdadm_Array_2" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>
    <monitor service="Mdadm_Array_3" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>
    <monitor service="Mdadm_Array_4" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>

If you have more than 4 arrays per node, you'll have to add the definition on poller configuration, incrementing the OID.

Generic Version

        <service name="Mdadm_Health" interval="300000"
            user-defined="false" status="on">
            <parameter key="retry" value="1"/>
            <parameter key="timeout" value="3000"/>
            <parameter key="port" value="161"/>
            <parameter key="oid" value=".1.3.6.1.4.1.2021.13.18.1.1.6"/>
            <parameter key="operator" value="&lt;="/>
            <parameter key="operand" value="2"/>
            <parameter key="walk" value="true"/>
        </service>
    <monitor service="Mdadm_Health" class-name="org.opennms.netmgt.poller.monitors.SnmpMonitor"/>