JRobin Spike Hunter

From OpenNMS
Jump to: navigation, search

Motivation

Occasionally, due to counter wraps or just plain bad data, huge spikes appear in the data collected by OpenNMS. It's relatively easy to use the JRobin RrdInspector to track down one or two of these spikes and replace them with NaN or another suitable value, but fixing them up in bulk is very tedious. Trying to do these fix-ups on a running system is nearly impossible since the updates must be done within a single collection cycle in order to minimize data loss. This tool is designed to run directly against a JRobin RRD archive.

Operation

usage: usage: spike-hunter [options]
 -p,--dump-contents           Just dump the DSes and RRAs in the JRobin
                              disk file.
 -n,--dry-run                 Just report spikes, do not make any changes
                              to the JRobin disk file.
 -o,--operands                Operands (numeric, comma-separated) for the
                              selected analysis strategy. Defaults to 95,5.
 -d,--ds-name                 Data source names on which to operate,
                              comma-separated. If unspecified, operate on all DSes.
 -a,--analysis-strategy       Data analysis strategy. Defaults to
                              percentile.
 -r,--replacement-strategy    Strategy for replacing spike samples, one of
                              nan|previous|next, defaults to nan
 -f,--file                    JRobin disk file on which to operate
 -h,--help                    This help text
 -q,--quiet                   Do not print any informational output
 -v,--verbose                 Print plenty of informational output

Analysis Strategies

Spike Hunter is designed so that it can use different analysis strategies. The initial version supports only a percentile analysis strategy.

Percentile Analysis Strategy

The percentile analysis strategy selects violating values as those that exceed a value equal to an Nth percentile (calculated across all values in a given archive) times a multiplier M. It takes two operands:

Percentile
an integer N that sets a percentile basis for the set of values being analyzed. For a set { 1, 2, 3, 4, 5, 6, 7, 8, 90, 100 } where N=80, the percentile basis would be 8.
Multiplier
an integer M that is multiplied with N to form an absolute ceiling above which all values will be replaced according to the specified replacement strategy. For the set above where N=80 and M=5, all values above 40 would be considered in violation.

Replacement Strategies

Methods for choosing the replacement value for samples that are in violation according to the selected analysis strategy are also designed to be pluggable. In the initial version, three replacement strategies are available.

NaN Replacement Strategy

Statically replaces all values with NaN (not a number), effectively removing the bad sample without trying to replace it with sensible data. A small hole in the data set results, assuming that spikes are isolated and spaced fairly far apart. This is the only replacement strategy that is tested and expected to work in the initial version.

Previous Replacement Strategy

Dynamically walks backward from the index of the sample found in violation, replacing the violating value with the value of the next lowest-indexed sample that was not also found to be in violation. For the set { 1, 2, 3, 4, 5, 2, 6, 1 } where the absolute ceiling calculated by the chosen analysis strategy is 3, the output would be { 1, 2, 3, 3, 3, 2, 2, 1 }. This strategy is quite likely not working at present

Next Replacement Strategy

Operates as the exact reverse of the previous replacement strategy. For the same set { 1, 2, 3, 4, 5, 2, 6, 1 } where the absolute ceiling calculated by the chosen analysis strategy is 3, the output would be { 1, 2, 3, 2, 2, 2, 1, 1 }. This strategy is quite likely not working at present

Nearest Replacement Strategy

Not yet implemented, but expected to choose between the previous and next strategies based on the shorter distance traveled to find a replacement value.

Dry Run Mode

In dry-run mode, Spike Hunter does not actually modify the RRD file. It performs analysis, chooses replacement values, and prints information about what it would do if you were not doing a dry run.

Dump Contents

In dump-contents mode, no data analysis is performed. Spike Hunter dumps the list of archive definitions and data source names in the RRD file and then exits.

Specifying Data Source Names

You may specify one or more data source names as a comma-separated list (e.g. ifInOctets,ifOutOctets). This is intended to allow the Spike Hunter to target just one data source when used against an RRD file generated by a system that is using storeByGroup. Note that no testing has been done on a storeByGroup system as of the initial version. If no data source names are specified, Spike Hunter operates on all data sources in the file.

Usage Example

$ java -jar spike-hunter-1.3.12-SNAPSHOT-jar-with-dependencies.jar -f ifInOctets.rrd -o 95,5
Operating on archive with CF AVERAGE, 1 steps
 Operating on DS ifInOctets
   Sample with timestamp Fri Jan 25 18:35:00 EST 2008 and value 9.504276672152653E7 replaced by value NaN
   Sample with timestamp Fri Jan 25 18:40:00 EST 2008 and value 1965154.7625812415 replaced by value NaN
   Sample with timestamp Fri Jan 25 23:35:00 EST 2008 and value 1.0853515656474456E7 replaced by value NaN
   Sample with timestamp Fri Jan 25 23:40:00 EST 2008 and value 1201576.814450727 replaced by value NaN
   Sample with timestamp Sat Jan 26 01:55:00 EST 2008 and value 1.4315325393066667E7 replaced by value NaN
   Sample with timestamp Sun Jan 27 15:30:00 EST 2008 and value 1.4283405304E7 replaced by value NaN
   Sample with timestamp Mon Jan 28 15:30:00 EST 2008 and value 1.3785921477866666E7 replaced by value NaN
   Sample with timestamp Mon Jan 28 15:35:00 EST 2008 and value 539742.7210666667 replaced by value NaN
   Sample with timestamp Mon Jan 28 19:55:00 EST 2008 and value 1.3536148165333332E7 replaced by value NaN
   Sample with timestamp Mon Jan 28 20:00:00 EST 2008 and value 781271.6285767443 replaced by value NaN
   Sample with timestamp Mon Jan 28 23:30:00 EST 2008 and value 1.34497775504E7 replaced by value NaN
   Sample with timestamp Mon Jan 28 23:35:00 EST 2008 and value 867316.7518936878 replaced by value NaN
   Sample with timestamp Tue Jan 29 05:40:00 EST 2008 and value 1.42601103696E7 replaced by value NaN
   Sample with timestamp Tue Jan 29 09:55:00 EST 2008 and value 1.127393244675083E7 replaced by value NaN
   Sample with timestamp Tue Jan 29 10:00:00 EST 2008 and value 2040171.2444491696 replaced by value NaN
   Sample with timestamp Tue Jan 29 21:20:00 EST 2008 and value 83407.7342192691 replaced by value NaN
   Sample with timestamp Tue Jan 29 22:05:00 EST 2008 and value 1.4070540779199999E7 replaced by value NaN
   Sample with timestamp Tue Jan 29 22:10:00 EST 2008 and value 616173.12 replaced by value NaN
   Sample with timestamp Tue Jan 29 22:15:00 EST 2008 and value 616173.12 replaced by value NaN
   Sample with timestamp Wed Jan 30 09:45:00 EST 2008 and value 4836218.265559248 replaced by value NaN
   Sample with timestamp Wed Jan 30 09:50:00 EST 2008 and value 8760581.822015505 replaced by value NaN
   Sample with timestamp Thu Jan 31 07:00:00 EST 2008 and value 1.276050119839114E7 replaced by value NaN
   Sample with timestamp Thu Jan 31 07:05:00 EST 2008 and value 669807.4533421927 replaced by value NaN
   Sample with timestamp Thu Jan 31 17:25:00 EST 2008 and value 89813.4314524917 replaced by value NaN
   Sample with timestamp Thu Jan 31 17:30:00 EST 2008 and value 90969.76426666667 replaced by value NaN
   Sample with timestamp Thu Jan 31 17:35:00 EST 2008 and value 83706.69038763797 replaced by value NaN
   Sample with timestamp Fri Feb 01 01:25:00 EST 2008 and value 1.3384521442133335E7 replaced by value NaN
   Sample with timestamp Fri Feb 01 01:30:00 EST 2008 and value 932733.2679760798 replaced by value NaN
   Sample with timestamp Fri Feb 01 06:30:00 EST 2008 and value 1.3152065184533333E7 replaced by value NaN
   Sample with timestamp Fri Feb 01 06:35:00 EST 2008 and value 1165396.5142183832 replaced by value NaN
   Sample with timestamp Sun Feb 03 01:20:00 EST 2008 and value 1.2818183628266666E7 replaced by value NaN
   Sample with timestamp Sun Feb 03 01:25:00 EST 2008 and value 1498661.8576 replaced by value NaN
   Sample with timestamp Mon Feb 04 16:10:00 EST 2008 and value 1.3993884670741972E7 replaced by value NaN
   Sample with timestamp Mon Feb 04 16:15:00 EST 2008 and value 339664.85779136216 replaced by value NaN
   Sample with timestamp Wed Feb 06 17:50:00 EST 2008 and value 9383699.483853823 replaced by value NaN
   Sample with timestamp Wed Feb 06 17:55:00 EST 2008 and value 2542138.8275526026 replaced by value NaN
   Sample with timestamp Wed Feb 06 22:45:00 EST 2008 and value 9137035.608990477 replaced by value NaN
   Sample with timestamp Wed Feb 06 22:50:00 EST 2008 and value 1.4276544853333334E7 replaced by value NaN
   Sample with timestamp Wed Feb 06 22:55:00 EST 2008 and value 1.4276544853333334E7 replaced by value NaN
   Sample with timestamp Thu Feb 07 22:10:00 EST 2008 and value 1.4096723475733334E7 replaced by value NaN
   Sample with timestamp Thu Feb 07 22:15:00 EST 2008 and value 220239.23718803987 replaced by value NaN
   Sample with timestamp Fri Feb 08 12:25:00 EST 2008 and value 1.3281148310400002E7 replaced by value NaN
   Sample with timestamp Fri Feb 08 12:30:00 EST 2008 and value 1043229.0424770764 replaced by value NaN
Operating on archive with CF AVERAGE, 12 steps
 Operating on DS ifInOctets
   Sample with timestamp Fri Jan 25 19:00:00 EST 2008 and value 1.6168240461376904E7 replaced by value NaN
Operating on archive with CF MIN, 12 steps
 Operating on DS ifInOctets
   Sample with timestamp Thu Jan 31 16:00:00 EST 2008 and value 20158.028520044292 replaced by value NaN
   Sample with timestamp Thu Jan 31 18:00:00 EST 2008 and value 21584.768951495018 replaced by value NaN
   Sample with timestamp Thu Feb 07 17:00:00 EST 2008 and value 25042.24 replaced by value NaN
   Sample with timestamp Thu Feb 07 18:00:00 EST 2008 and value 20290.784857585826 replaced by value NaN
Operating on archive with CF MAX, 12 steps
 Operating on DS ifInOctets
   Sample with timestamp Fri Jan 25 19:00:00 EST 2008 and value 9.504276672152653E7 replaced by value NaN

Building It

As of now (13 February 2008) the only way to get the Spike Killer is to build it from the OpenNMS source. For information on checking out and building the source, start at Development.

In the checked out source tree, change into the opennms-tools/jrobin-spike-killer directory. Invoke the Maven build script from this directory:

$ cd ~/git/opennms-trunk/opennms-tools/jrobin-spike-hunter
$ ../../compile.pl install

This will create an executable JAR file in the target directory of the current directory with a name like spike-hunter-1.3.12-SNAPSHOT-jar-with-dependencies.jar (the version will vary according to your position along the time dimension).