OpenNMS is the focus of an article published on Linux Planet. It's a case study of how OpenNMS was deployed at one company to consolidate and improve on their existing network management solution. In case the link fails, the article is included here.
Why Do We Need a Systems Management Tool?
Current trends in the IT world continue to accelerate the rate of change in every area. Applications, server platforms and networks are no longer the slow moving entities they once were. They are subject to change on an almost daily basis. In this environment, it becomes more and more important for the IT Operations team to quickly detect, and respond to changes, or anomalous events.
My employer is a relatively new business. Applications would be customized packages, and a large section of its core IT systems would be outsourced. Slowly, those package based solutions morphed into custom applications and, for a variety of reasons, these outsourced systems were brought back in-house a couple of years ago. This presented those of us in the IT department with some interesting challenges. One of those challenges was how to go about managing our newly re-acquired IT infrastructure and applications.
When we first decided to move our core systems from an outsourced to an in-house IT Operations Department, our requirements were limited. Checking the availability of some services and the load on the network and key servers was about as much as we thought we needed.
It became obvious over time that this was rather optimistic. Each new service added seemed to result in a new management tool being installed on a System Administrator's workstation. At one point we had three separate network monitoring systems, three separate performance management tools and a plethora different scripts, web pages and command line tools. The DBA team had one tool, the Network Admins another, the Unix and Windows teams yet another. We sent out critical alerts by email, pager, and SMS, often to completely inappropriate people.
The company was growing, and it looked like it was beginning to need a grown-up systems management tool, but which one?
What Do We Expect from a Systems Management Application?
There is definitely a "sweet spot" for systems management applications. Some are suited to smaller environments, others are most definitely suited to enterprise scale environments with more demanding requirements. Unsurprisingly the enterprise scale products often come with enterprise scale price tags and learning curves.
We had a few key requirements:
- Platform independence: Our network management system would have to run on available hardware (at the time, SPARC/Solaris).
- Performance: Any solution would need to scale from a few hundred nodes to a few thousand nodes.
- Enterprise level features: We required at least SNMP trap management, configurable alert escalation and availability and performance reports for the management team.
- Rationalize support roles: We needed to be able to take individuals out of the process. That meant an end to emails sent by systems to developers in the middle of the night. Our operations team needed to be the first contact for every event.
- Reduce tasks: It would need to lighten the burden on the Operations Team, not increase it.
- Extensibility: Previous experience indicated that there was no such thing as a complete solution.
- Low cost of entry: It needed to replace a portfolio of Open Source products.
- Longevity: Some Open Source products seem to wither on the vine with no apparent cause, or fragment through disagreements between developers. Commercial products too are subject to the vagaries of the market.
OpenNMS checked a lot of these boxes. It was (mostly) java, so we could run it on our Sun hardware. OpenNMS was already running in environments an order of magnitude larger than ours. It had a lot of the enterprise level features absent from other Open Source products. There were documents available on the Internet that pointed to its extensibility. It was based on a lot of familiar components (tomcat, postgres, rrdtool). Finally, in Open Source terms, it was a relatively mature product.
We took a cautious approach deploying OpenNMS.
Simplest to replace, and therefore first to go were the existing network monitoring products. Only after a month of parallel running with OpenNMS did we decommission our existing solutions.
Second to go were the diverse collection of emails that were sent by applications or batch jobs. We replaced the destination email addresses with some mailboxes that delivered the notifications directly into OpenNMS. This turned out to be a bigger win than we'd expected. By having a central point where application alerts could be received and processed, we revealed hidden issues with applications that had existed for weeks or months.
This was painful at first. The respective teams were often uncomfortable in having their problems aired to the world. Once we started to address these problems, however, and the frequency of the alerts started to reduce, we started to see real benefits. The operations team had a single console to monitor applications, and we could reduce the number of application support staff on call.
The next target was system performance data collected by our existing tools. That which could be readily moved into OpenNMS went quickly. Platform specific data collectors (such as those which collected from Microsoft hosts using WMI) had any important alerts channeled in to OpenNMS.
Our current focus, now that we believe our OpenNMS installation is mature, is back in application space. We are extending the end-to-end monitoring capabilities of OpenNMS to our web services providers. We are also starting to use it to retrieve instrumentation data directly from applications themselves, as well as their hosts.
Did We Meet Our Requirements?
Here's how things shook out:
- Platform independence: Yes. OpenNMS can run on spare hardware. But it's not a good idea. A year after our first rollout of OpenNMS, we moved from a shared SUN Ultrasparc 2 machine to a dedicated dual Xeon machine running RedHat Advanced Server.
- Performance: Yes. We are comfortable in that there will always be users pushing the scalability of OpenNMS much harder than we are.
- Enterprise Level Features: A cautious yes. OpenNMS met our initial requirements, but also quickly highlighted new ones. Some customers are never satisfied.
- Rationalize Support Roles: Yes. OpenNMS is now the single point for the distribution of all actionable network, server and application events. This does need to be constantly policed, to ensure that non-standard notification paths do not creep in again.
- Reduce Tasks: A cautious yes. In general, the operator's load has lessened, if only because it has reduced the numbers of open windows on their desktops.
- Extensibility: Yes. OpenNMS has proved to be highly extensible.
- Low cost of entry: We deployed OpenNMS with minimal capital outlay. We believe that the subsequent people based operational costs have been roughly equivalent to those of a commercial solution.
- Longevity: We seem to have backed a product with "legs." The mailing lists are as busy as ever and new features are being added to OpenNMS faster than we can make use of them.
The "sweet spot" for OpenNMS seems to be about as wide as any Open Source solution and getting bigger by the month. We look forward to enhancements in the web user interface, a new JMX based data collector and support for event correlation in the near future.
Conclusions and Lessons Learned
What are the key lessons we have learned? In no particular order:
- Address the provision of systems management in prioritized, manageable units of work.
- Don't try to do everything at once.
- Manage the rollout of your systems management application just like any other implementation. After all, availability and integrity of your management application should ideally exceed that of the most critical component that it manages.
- Get buy in from all the stakeholders in the process, from Management to Shift Operators.
- Think about who will use the tool. You probably don't want to send alerts regarding printer failures to your DBA team. Similarly it's probably not a good idea to send arcane messages about network topology changes to application support teams. Prioritize system alerts and trim out noise.
- Don't use it a stick to beat developers, network admin or systems admin staff. If your network management tool highlights a problem, use that information as a justification to provide the resources to fix it.
- Do your research. Adopting an Open Source solution requires just as much rigor in the selection and evaluation process as a proprietary solution. By all means download your candidate solution and try it out, but don't allow a machine under a Systems Administrator's desk to become a mission critical component.
- It's not been a totally smooth ride. We had repeated problems with memory leaks within the Java virtual machine (admittedly, not OpenNMS's fault). We also had a few nasty problems with corruption of OpenNMS's back-end database, which are now fixed. There were also a lot of "d'oh!" moments along the way, as we got up to speed with what is a pretty complex application. None of these problems ever seemed like show stoppers at the time. This had much to do with help extended by the development team and user community, to whom we extend our thanks.