Snmp Issues

From OpenNMS
Jump to: navigation, search

This page is to track the desired behavior of data collections in general and SNMP in particular for 1.4 and 2.0

There are several requirements for choosing which version of SNMP that should be used for data collection. As an enterprise tool, we need to be able to intelligently and automatically determine which version to use.

SNMP Version Choice

In 2.0, we should have at least four options for SNMP version: SNMPv1, SNMPv2c with GETBULK, SNMPv2c without GETBULK, and SNMPv3. The two SNMPv2c variants result from the fact that some agents that "support" v2c fall over over or stop responding when they receive a GETBULK. While OpenNMS could just default to v1, that would rule out collecting 64-bit counters, which are increasingly common and important.

Assuming that nothing is in snmp-config.xml but defaults (i.e. no <definition> tags), OpenNMS should test and use in this order:

  • SNMPv3
  • SNMPv2c (w/ optimized GETBULK)
  • SNMPv2c
  • SNMPv1

It would also be nice if this order could be configured.

There also needs to be a way to store the version chosen for a given interface in the database so that it doesn't need to be tested each time OpenNMS starts up. If something happens to cause that service to fail, a rescan should correct to the next best version.

For example, there is a device that is added to OpenNMS that supports SNMPv3, SNMPv2c and SNMPv1. Per the above list, the application should mark the device to use SNMPv3.

Then something happens, and SNMPv3 stops working. A "data collection failed" event will be generated. The user should then investigate the problem and correct it. If, however, it is not possible to correct it, they should be able to rescan the device, where SNMPv3 support will fail, but SNMPv2c support will still work, and thus SNMPv2c support will be chosen.

It may be possible to fail over automatically without the rescan, but it would probably be unwise. The additional overhead for such code on large networks (i.e. 50K interfaces) would be detrimental. It would be hard to detect the difference between a timeout due to the agent being completely down versus a service no longer working.

Display Version in the webUI

Whatever version is being used for data collection should be displayed on at least the node's page.

Multiple Ports

Sometimes a node has multiple agents that listen on different ports of the node's interfaces. For instance, there are a number of HP-UX servers running the default HP SNMP agent on port 161. However, that agent doesn't provide some features or information that the Net-SNMP agent can, so the user decides to run the Net-SNMP agent on port 1161.

It is possible to collect from both:

<package name="example1">
  <filter>IPADDR IPLIKE *.*.*.*</filter>
  <include-range begin="" end=""/>
  <service name="SNMP" interval="300000" user-defined="false" status="on">
    <parameter key="collection" value="default"/>
    <parameter key="port" value="161"/>
    <parameter key="retry" value="3"/>
    <parameter key="timeout" value="3000"/>
  <outage-calendar>zzz from poll-outages.xml zzz</outage-calendar>

<package name="net-snmp">
  <filter>pollerCategory == "net-snmp"</filter>
  <include-range begin="" end=""/>
  <service name="SNMP" interval="300000" user-defined="false" status="on">
    <parameter key="collection" value="net-snmp"/>
    <parameter key="port" value="1161"/>
    <parameter key="retry" value="3"/>
    <parameter key="timeout" value="3000"/>
  <outage-calendar>zzz from poll-outages.xml zzz</outage-calendar>

Capsd scans the HP-UX agent and adds all of the info to the database, such as systemOID. Then collectd is configured to look at a different port and a different collection schema (net-snmp) which just includes the info provided by the Net-SNMP agent. You wouldn't want to double collect interface information, for example.

Improve Collectors in capsd

In the beginning, data collection was an extension of the SNMP monitor service, and there was no Collectd. Then the poller was basically (and some might say, poorly) cloned into Collectd. In 1.3 and 1.7 Collectd was extensively refactored, but it is probably time to draw even more lines between services used in collection and services used in monitoring.

For example, by default SNMP is used in collection, but not monitoring. In Bug 1443 there is a problem where inactive services cause dead nodes not to be deleted, since there is still one or more services on the node. But supposed SNMP was that "inactive" service? Sure, it's not being monitored, but it is being used for collection.

The time has come to split those "services" used for monitoring with those used for collection. So, there could be an "SNMP" service that covered all versions of SNMP, "JBoss" service that covered all versions of JBoss, etc.

Multiple Community Strings

There are a couple of open bugs on multiple community string support, such as Bug 1041. Several people have asked for the ability to have OpenNMS try a list of community strings against a device. While the standard answer has been "manage your comm strings better" perhaps this is a chance for us to add this via configuration if it isn't too much trouble.

For quick and dirty multiple community string support see Script: Multiple SNMP Community String