Measuring the performance of your IT systems (Part I)

By Andre Griffith

We promised to look at an interesting case of IT Governance in action in one of our public agencies this week.   The intended subject was the Georgetown Municipality, however, I have since learnt that an investigation is imminent and prefer therefore not to offer public comment at this time.  In this column, we will instead, return to one of the “six IT decisions” enumerated by Jeanne Ross and Peter Weill which they advised,  “your IT people should not make”.  This week we take at look at the decision relating to the question how good do the IT systems need to be?

In order to answer the question it is first necessary to understand that this question relates to how good your systems are operationally as opposed to functionally.  The functional perspective has to be addressed during the development or implementation of any particular system and speaks to whether the system does what it is supposed to do.  As examples of these functional considerations you might ask, “does your invoicing system correctly generate invoices for your customers?  Does your fixed asset system correctly classify assets, works in progress etc and does it apply the correct rules for depreciation? Does an Air Traffic Control System correctly implement assignments for vertical separation of aircraft? These are all (deceptively simple) expressions of functional questions and ensuring that functional requirements are met is a project management issue addressed during development and implementation of various software systems.

The question of how good your system needs to be in the operational sense assumes that the functional realm was adequately addressed before the system was put into operation.  The importance of this assumption can probably be inferred from considering the consequences of error in the Air Traffic Control example).  In the operational realm, there are a number of metrics that can be chosen to evaluate performance, however there are three basic metrics that should be a starting point for any initiatives to measure operational performance of an IT system.  These metrics are availability, reliability and response time.

Assuming that we have at our disposal, a computer system that adequately implements some set of useful functions (eg invoicing our customers), that system is available if users are able to access that system and use all of the capabilities that it offers.  Conversely, a system is unavailable if users are unable to access it and use its capabilities.  We usually measure availability as a percentage and here some interpretations become interesting.  Consider a business that operates between the hours of 7 a.m. to 5 p.m. that is, for a total of ten (10) hours each day.  If its point of sales system becomes unavailable at 3 p.m. on Thursday, and is restored at 9 p.m. on Friday, how long has it been unavailable?  A simple but extreme approach is to count elapsed time from the instant the system fails, to the time it is restored and this would give us 18 hours of unavailability.  Measured over two days (48 hours), this unavailability is 37.5%.  If we go to another extreme we measure only the unavailability during normal hours of business when the system is expected to be in use.  In this case, there are only three unavailable hours, that is two hours on Thursday afternoon and one hour on Friday.  We need also to take into account that the business day in this example is only 10 hours long, so in the two days, we have three unavailable hours out of 20 that is about 15%.  You can immediately see which measure would be more attractive to the IT manager, and in actual fact, the latter measure is probably more reasonable since it more closely approximates business impact in most cases.  In the former case, you would say that your system exhibited availability of 62.5%, while in the latter case, the exact period of outage would be interpreted as a system availability of 85%.  As a general caution, both values represent poor performance in most cases especially where the system is such that it is critical to your operations.  Returning to an Air Traffic Control system, although locally, our facilities would be able to manage without an automated system at all, any moderately busy “small airport” would not be able to function without an automated Air Traffic Control system.

The second basic measurement of operational performance is reliability and this goes hand in hand with availability.  Consider again, an operation where there is a 10hr working day.  Let us say for argument that the owners of the business are satisfied with a minimum availability of 90% and that their IT department achieves this figure which corresponds to one hour’s breakdown in a day.  Now that hour could have been the first hour of the day, from 7 am to 8 am which may not have such a high impact. It could however, have been between 4 and 5 pm, where more customers are likely to have been present in fact it could have been any one hour period.  The worst case however, is that the system unpredictably breaks down in a random manner and becomes unavailable for a number of times all totaling one hour.  It can fail six times during the day for ten minutes each time, or it can fail ten times with six minutes to each failure.  It is readily seen that there are literally an infinite number of combinations of failure occurrence and failure duration totaling that hour.  This scenario earmarks an unreliable system.  So while your IT manager might be very happy to report to you that he or she has met your 90% availability measure, you are still unhappy.  You therefore need to tell him or her that in addition to meeting your requirement on availability, he is required to ensure that there are no more than a specified number of failures in any hour, day, month or year.  In the above example, you could have limited the failure occurrences to no more than say three in any month.  Thus availability has its twin in reliability and both must be managed.

The last basic metric that is often times useful is “response time”.  This metric addresses the common
complaint of users that “the system is slow”.  When a user moves a mouse or types a set of characters corresponding to a request/instruction, that request has to be sent over a network to a server and the result transmitted back to the user.  Response time is the measure of the time that a user has to wait for evidence that her request or instruction has been executed.  So for example if you send a request to print an invoice and that invoice is printed three seconds later, the response time is 3 seconds.

Users of information technology are famous for refusing to commit to specific indicators by which performance can be measured, and response time is one of the hardest to pin down.

At the simplest, one can have a blanket statement that says the maximum response time for any event should not exceed say one second, however there are many instances where one second is unacceptably long (echoing characters to a screen as you type) as equally there are instances where 1 second would be unreasonably short, for example printing a receipt.

In availability, reliability and response time, executives are equipped with basic tools by which they can specify performance parameters for their IT systems and by extension, for their IT personnel.  At the end of the planning process, management would have selected a set of specific performance targets for the IT department to meet.

In the continuation of this theme we shall examine some reasonable performance benchmarks, some of the considerations in achieving those benchmarks and the consequent costs of selecting various performance targets which can serve to guide in decision-making.