Friday, April 20, 2007

Service Level Automation Deconstructed: Analyzing Service Levels

This is the second post in my series providing a brief overview of the three critical assumptions of a Service Level Automation environment. Today I want to focus on the ways in which the metrics gathered from the "measure" capabilities of an SLA environment are evaluated to determine if and what action should be taken by the "response" capabilities.

Let me first acknowledge that my discussion of the measure capabilities included some analysis of simple metrics to create complex metrics. This is one piece of the analysis puzzle, and is a critical one to acknowledge. Ideally, all software and hardware systems would be designed to intelligently communicate the metrics that matter most to determine service levels. Where this consolidation occurs depends on the requirements of the environment:
  • Centralized approach: Gather fundamental data from target systems to central metrics processor and consolidate metrics there. The advantage here is having one place to maintain consolidation rules. The disadvantage is increased network traffic.
  • Decentralized approach: Gather fundamental data and do any analysis necessary to consolidate the fundamental data into a simplified composite metric there. Send the composite metric to the core rules engine (which I will discuss next).

Metrics consolidation is not really the core analytics function of a Service Level Automation architecture, however. The key functions are actually the following:

  • Are metrics being received as expected? (A negative response would likely indicate a failure in the target component or the communication chain with that component)
  • Are the metrics within the business and IT service level goals set for that metric
  • If metrics are outside of established service level goals, what response should be taken by the SLA environment

Given my recent reading into complex event processing (CEP), this seems like at least a specialized form of event processing to me. The analysis capabilities of an SLA environment must constantly monitor the incoming metrics data stream, look for patterns of interest (namely goal violations, but who knows...) and trigger a response when conditions dictate.

The great thing about this particular EP problem is that well designed solutions can be replicated to all data centers using similar metrics and response mechanisms (e.g. power controllers, OSes, switch interfaces, etc.). Since there are actually relatively few components in the data center stack to be managed (servers [physical, virtual, application, etc.], network and storage), the rule set required to provide basic SLA capabilities is replicable across a wide variety of customer environments.

(That's not to say the rule set is simple...its actually quite complex, and can be affected by new types of measurement and new technologies to be managed. Buy is definitely preferred over build in this space, but some customizability is always necessary.)

Finally, I'd also like to point out that there is a similar analysis function at the response end as at the measure end. Namely, it is often desirable for the response mechanism to take a composite action request and break it into discrete execution steps. The best example I can think of for this is a "power down" action sent from the SLA analysis environment to a server. Typical power controllers will take such a request, signal to the OS that a shutdown is imminent, whereupon the OS will execute any scripts and actions required before signalling that OS shutdown is complete. At that time, the power controller turns off power to the server.

As with measure, I will use the label "analyze" to reflect future posts expanding on the analysis concept. As always, I welcome your feedback and hope you will join the SLA conversation.

No comments: