Instrumentation Log Reader Specification

From OpenNMS
Jump to navigation Jump to search

Instrumentation Log Reader

The purpose of the instrumentation log reader is to be able to determine how long data collection is taking for each node in the system and which nodes are taking the most time.

Input

The instrumentation log reader will be a java program (or a script that calls a java program) that takes the path to an instrumentation log file as produced the a running opennms system. The instrumentation log must have been produced with DEBUG enabled in order to properly produce parse-able data.

The program will read each line of the log file to determine the time at which various events took place during data collection and this data will be used to compute the information necessary.

There are many different lines generated in the instrumentation log. There are initialization lines and operational lines. Many of the lines come in pairs representing timestamps for the beginning and ending of an operation.

General Format

An instrumentation log entry has the following general format:

 <timestamp> DEBUG [<thread-name>] <operation type> <event-type>: <optional service identifier> <optional error-info>
timestamp 
the time at which the entry occurred is in the format yyyy-MM-dd HH:mm:ss,S (as defined in http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html)
thread-name 
a unique name representing the thread that is doing this processing
operation-type 
indentifies what Collectd was doing when this log entry was posted.
event-type 
either begin, end or error and indicated which part of the operation the entry represents
service identifier 
if this entry is associated with a particular service then the identifier for this service is listed here in <nodeid>/<ip-address>/<service-name> format.
error-info 
if this entry represents an error entry then the unformatted details of the error are here until the end of the line


Collection Lines

These line represent the beginning and end of collecting the data from the agent (this excludes persisting the data)

 2010-05-26 12:12:40,883 DEBUG [CollectdScheduler-50 Pool-fiber11] Collectd: collector.collect: collectData: begin: 24/216.216.217.254/SNMP
 2010-05-26 12:12:48,027 DEBUG [CollectdScheduler-50 Pool-fiber11] Collectd: collector.collect: collectData: end: 24/216.216.217.254/SNMP

An error line for data collection looks like the following:

2010-05-26 12:12:47,776 DEBUG [CollectdScheduler-50 Pool-fiber4] Collectd: collector.collect: error: 7/172.20.1.12/SNMP: org.opennms.netmgt.collectd.CollectionTimedOut: Timeout retrieving SnmpCollectors for 172.20.1.12 for kenny.internal.opennms.com/172.20.1.12: SnmpCollectors for 172.20.1.12: snmpTimeoutError for: kenny.internal.opennms.com/172.20.1.12

When an error occurs the end time is usually still produced but the begin and end data should be excluded from timing collections because of the error.

Persistence Lines

These lines represent the beginning and ending timestamp of handing the persistence data off to queued for storage.

2010-05-26 12:12:48,027 DEBUG [CollectdScheduler-50 Pool-fiber11] Collectd: collector.collect: persistDataQueueing: begin: 24/216.216.217.254/SNMP
2010-05-26 12:12:51,745 DEBUG [CollectdScheduler-50 Pool-fiber11] Collectd: collector.collect: persistDataQueueing: end: 24/216.216.217.254/SNMP

Output

For the first implementation we will simply output the data to the console (ie println).

The data will be output as a list of 'global' stats followed by a table of stats for each services reference in the log.

Global Stats

These are the global that we will compute from the log and produce at the beginning of the output.

  • Start time of first collection (C)
  • End time of last collection (C)
  • Collection Duration - time different between the two above (C)
  • Total number of services collected (C)
  • Number of unique threads used (B)
  • Average Time it takes to collect the data for all the services (A) (RoundTrip time)

Per Node/Service Stats

These are the 'per service' stats that will be output in a table with a row for each service.

  • Average Time per collection of service (C)
  • Average Time to persist data for service (B)
  • Average Time between collections for service (A)
  • Number of times data collected for a service (C)
  • Total Time Collection data for a service (C)
  • Total Time Persisting data for a service (B)


These services will be ordered in descending out by 'average collection time'. That is the service whose collection time took the longest would be list first.

Sample Output


Start Time: 2010-05-26 12:12:40,846
End Time: 2010-05-26 13:12:40,846
Duration: 1h5m32.789s
Total Services: 123
Theads Used: 27
Roundtrip Time: 12m


Service               Avg Collect Time  Avg Persist Time  Avg Time between Collects # Collections Total Collection Time Total Persist Time
19/172.10.1.21/SNMP       13.458s             .002s              5m27s                    3                 45.98s           .010s
....

TODOs

Implementation

Strategy

  • Read in each line of the log file
  • Parse the read log line and extract data
  • Track/Organize parsed data into manageable chunks
  • Show parsed/important/relevant data in a graphical view (UI)
  • Store data in a text file for future reference

Strategy Explained (The useful bits)

An example instrumentation.log file line

2010-06-01 13:53:41,062 DEBUG [CollectdScheduler-50 Pool-fiber0] Collectd: scheduleExistingInterfaces: begin

In the example line above the important data is as follows. The first part of the line is the timestamp.

2010-06-01 13:53:41,062

The second piece is the type of log message, in this case its DEBUG.

Thirdly we have the thread that this log message came from.

[CollectdScheduler-50 Pool-fiber0]

And lastly we have the log message.

Collectd: scheduleExistingInterfaces: begin

The Blueprints (The How-To Guide)

This project will be written following the Test Driven Development paradigm where we write tests for all of the pieces then put the application together using the tested code.

  1. Write a test the reads in a line from the instrumentation.log file
    1. Verify that the line matches what you expect
  2. Write a test that reads in a line from the log file and check that the timestamp is what you expected
  3. Write a test that correctly identifies each segment of the log file line, and verify that what you get back is what you expected
  4. Write a test that creates an object that stores the data from the line and verify that the data in the object is what you expected
  5. Write a test that displays an object as a graphic and verify that the data the graphic displays matches the date in the object
  6. Write a test that stores the date in a text file for archiving and verify that the text file receives the correct data

Acceptance Criteria

The following steps will walk you through running the application. If you complete these steps and see everything described as it should be you know the application is working.

Step 1

Open the Instrumentation log file reader and pass in the instrumentation.log file

Step 2

Watch the progress of the application parse out the useful bits of the log file and build out graphs/charts

Step 3

If you see the useful information then you know its working.