Instrumentation Log Reader Specification
- 1 Instrumentation Log Reader
- 2 TODOs
- 3 Implementation
Instrumentation Log Reader
The purpose of the instrumentation log reader is to be able to determine how long data collection is taking for each node in the system and which nodes are taking the most time.
The instrumentation log reader will be a java program (or a script that calls a java program) that takes the path to an instrumentation log file as produced the a running opennms system. The instrumentation log must have been produced with DEBUG enabled in order to properly produce parse-able data.
The program will read each line of the log file to determine the time at which various events took place during data collection and this data will be used to compute the information necessary.
There are many different lines generated in the instrumentation log. There are initialization lines and operational lines. Many of the lines come in pairs representing timestamps for the beginning and ending of an operation.
An instrumentation log entry has the following general format:
<timestamp> DEBUG [<thread-name>] <operation type> <event-type>: <optional service identifier> <optional error-info>
- the time at which the entry occurred is in the format yyyy-MM-dd HH:mm:ss,S (as defined in http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html)
- a unique name representing the thread that is doing this processing
- indentifies what Collectd was doing when this log entry was posted.
- either begin, end or error and indicated which part of the operation the entry represents
- service identifier
- if this entry is associated with a particular service then the identifier for this service is listed here in <nodeid>/<ip-address>/<service-name> format.
- if this entry represents an error entry then the unformatted details of the error are here until the end of the line
These line represent the beginning and end of collecting the data from the agent (this excludes persisting the data)
2010-05-26 12:12:40,883 DEBUG [CollectdScheduler-50 Pool-fiber11] Collectd: collector.collect: collectData: begin: 24/22.214.171.124/SNMP 2010-05-26 12:12:48,027 DEBUG [CollectdScheduler-50 Pool-fiber11] Collectd: collector.collect: collectData: end: 24/126.96.36.199/SNMP
An error line for data collection looks like the following:
2010-05-26 12:12:47,776 DEBUG [CollectdScheduler-50 Pool-fiber4] Collectd: collector.collect: error: 7/172.20.1.12/SNMP: org.opennms.netmgt.collectd.CollectionTimedOut: Timeout retrieving SnmpCollectors for 172.20.1.12 for kenny.internal.opennms.com/172.20.1.12: SnmpCollectors for 172.20.1.12: snmpTimeoutError for: kenny.internal.opennms.com/172.20.1.12
When an error occurs the end time is usually still produced but the begin and end data should be excluded from timing collections because of the error.
These lines represent the beginning and ending timestamp of handing the persistence data off to queued for storage.
2010-05-26 12:12:48,027 DEBUG [CollectdScheduler-50 Pool-fiber11] Collectd: collector.collect: persistDataQueueing: begin: 24/188.8.131.52/SNMP 2010-05-26 12:12:51,745 DEBUG [CollectdScheduler-50 Pool-fiber11] Collectd: collector.collect: persistDataQueueing: end: 24/184.108.40.206/SNMP
For the first implementation we will simply output the data to the console (ie println).
The data will be output as a list of 'global' stats followed by a table of stats for each services reference in the log.
These are the global that we will compute from the log and produce at the beginning of the output.
- Start time of first collection (C)
- End time of last collection (C)
- Collection Duration - time different between the two above (C)
- Total number of services collected (C)
- Number of unique threads used (B)
- Average Time it takes to collect the data for all the services (A) (RoundTrip time)
Per Node/Service Stats
These are the 'per service' stats that will be output in a table with a row for each service.
- Average Time per collection of service (C)
- Average Time to persist data for service (B)
- Average Time between collections for service (A)
- Number of times data collected for a service (C)
- Total Time Collection data for a service (C)
- Total Time Persisting data for a service (B)
These services will be ordered in descending out by 'average collection time'. That is the service whose collection time took the longest would be list first.
Start Time: 2010-05-26 12:12:40,846 End Time: 2010-05-26 13:12:40,846 Duration: 1h5m32.789s Total Services: 123 Theads Used: 27 Roundtrip Time: 12m Service Avg Collect Time Avg Persist Time Avg Time between Collects # Collections Total Collection Time Total Persist Time 19/220.127.116.11/SNMP 13.458s .002s 5m27s 3 45.98s .010s ....
- Read in each line of the log file
- Parse the read log line and extract data
- Track/Organize parsed data into manageable chunks
- Show parsed/important/relevant data in a graphical view (UI)
- Store data in a text file for future reference
Strategy Explained (The useful bits)
An example instrumentation.log file line
2010-06-01 13:53:41,062 DEBUG [CollectdScheduler-50 Pool-fiber0] Collectd: scheduleExistingInterfaces: begin
In the example line above the important data is as follows. The first part of the line is the timestamp.
The second piece is the type of log message, in this case its DEBUG.
Thirdly we have the thread that this log message came from.
And lastly we have the log message.
Collectd: scheduleExistingInterfaces: begin
The Blueprints (The How-To Guide)
This project will be written following the Test Driven Development paradigm where we write tests for all of the pieces then put the application together using the tested code.
- Write a test the reads in a line from the instrumentation.log file
- Verify that the line matches what you expect
- Write a test that reads in a line from the log file and check that the timestamp is what you expected
- Write a test that correctly identifies each segment of the log file line, and verify that what you get back is what you expected
- Write a test that creates an object that stores the data from the line and verify that the data in the object is what you expected
- Write a test that displays an object as a graphic and verify that the data the graphic displays matches the date in the object
- Write a test that stores the date in a text file for archiving and verify that the text file receives the correct data
The following steps will walk you through running the application. If you complete these steps and see everything described as it should be you know the application is working.
Open the Instrumentation log file reader and pass in the instrumentation.log file
Watch the progress of the application parse out the useful bits of the log file and build out graphs/charts
If you see the useful information then you know its working.