Beyond WebSphere Family Monitoring

The integrated application processes that drive real-time business must flow smoothly, efficiently and without interruption. These processes are comprised of a series of interrelated complex events which must be tightly managed to ensure the health of your business processes. To achieve this, deep visibility into the core metrics of business processes is crucial.. But this level of insight is impossible when you’re limited to static events. This paper explores the next level in dealing with the increasing complexity of inter-application communication in the real-time enterprise with the ability to dynamically create events based on actual conditions and related data elements across the enterprise.

Stanford University Emeritus professor David Luckham has spent decades studying event processing. In his book, The Power of Events he says that we need to

“View enterprise system based on how we use them – not in terms of how we build them.”  This is an important paradigm shift which can move us towards the goal of business activity monitoring

Currently we have Enterprise systems management (ESM) with a foundation of event based monitoring (EBM). It provides availability with automation using event based instrumentation, with threshold based performance and alerts and notifications. This is good but not good enough for the evolving needs of real time business. Gartner says that event driven applications are the next big thing.

Typically in event based monitoring events come from different middlewares – web servers, databases, applications, network devices, mobile devices, through the IT infrastructure of middleware, application servers, applications and the network. The events can be seen by the data centre. The idea is that the ESM system will monitor, detect, notify and take corrective action either automatically or with manual intervention. But there are different business users with their own perpectives.

Event based ESMs can’t really take good corrective action as they can’t correlate the event with the effect.

Here’s a typical example of a business activity. The head office requests a price change in all the stores. It updates the price in the master database and checks the inventory levels and then transmits the change to all the stores. But what happens if something goes wrong.

You/the customer will be asking yourself lots of questions. 1. From an IT perspective you’ll want to know where the problem/slowdown is i.e. which queue or channel has the problem that caused the change not to happen. But there are business level questions too. 2. You want to know why the change didn’t take place at all the stores. 3. You need to ask yourself from a business perspective what the impact of this is. 4. You may know that a channel’s gone down or you may know that a price change hasn’t happened but do you know what else has been affected?

These are problems that you don’t really have time to address. EBMs don’t know what to do about this out of the box as they are designed at a technology level and to configure them to understand the business is too hard to set up. With constantly changing business and application needs you can’t adapt your monitoring and automation fast enough.

Here is another real life example.  A stock trade or share purchase. The customer says they want to buy something. You check their account, then the stock availability and price and then they agree they want to buy at that price. Then you process and complete the trade and update the stock price.

This is straight through processing. The transaction has to be done atomically as one unit of work within a certain time. These are serially dependent transactions. But what happens if they don’t complete in time. Again you will ask yourself the questions. From an IT perspective you will want to know what the cause of the problem was. But also the business units who were affected will want to know about it. You will want to know which business units and transactions were affected. The correct, and only the correct, business guys will want to know. You will want to know the business impact of this – to see the problem and impact from a business perspective.

So what is the business impact? At a high level its loss of money. You’ll know that the transaction didn’t take place but you won’t know the real root cause so everyone will be blaming everyone else, wasting time and damaging morale and relationships. During this time you are not delivering the business service. You have damaged your relationship with the customer as you haven’t delivered what they needed and you can’t even explain why. So you’re going to lose your competitive advantage.

To give this root cause analysis and business impact analysis customers normally have to put a lot of resource into customising an event based solution or just developing their own monitoring solution but this is not flexible enough. It is not feasible in an increasingly complex environment of technology. So we have to ask ourselves how can we have a monitoring solution that is flexible enough to keep our business systems productive, rapidly and constantly adapting to incorporate a changing IT and business environment.

So in summary, the big questions are Why is it so hard to detect and prevent these situations? How can we make the transition to real-time on-demand monitoring. How can we align our IT environment with the business priorities to achieve the business goals?

These problems arise because we’re using event based monitoring. Monitoring at an IT or technology level is preventing us from achieving business activity monitoring.

David Luckham refines this more to talk about Business Impact Analysis – overcoming IT blindness. We should be looking at the complex events. Correlating or aggregating the various events and metrics to see the business impact. He talks about the event hierarchy and the processing of complex events. MQ has about 40 static events like queue full, channel stopped etc. But there are events from WAS, DB2 etc, and there are metrics like channel throughput, cpu usage and TIME. There are also home grown apps which need monitoring and there are business events and metrics. All these need to be taken into account to give a higher level complex event. For example if a queue is filling up at a certain rate you can calculate that in a certain amount of time you will receive a static simple queue high event. But by that time it will be too late. You need to aggregate the metrics queue fill rate, queue depth, maximum depth and time to generate a complex event.

So the problems with the current state of event based monitoring are:

Event causality – there’s not enough information to identify the root cause of the problem. The price update didn’t’ happen, but why? MQ failed but why? Maybe it was a disk space problem. Maybe it was caused by something in a different part of the environment – a different computer or application.

Interpretation – looking at it at a simple technology level we don’t have enough information to see the effect of this simple problem on the different parts of the enterprise – to see the effect from the perspectives of the different users, and to notify them and resolve the problems it causes for them.

Agility – Out of the box ESM or EBM solutions cannot possibly know the business requirements. They require a lot of customisation when you initially set them up to be able to understand the effects of different problems on the different users and then constant customisation as the technological and business environment constantly changes. They are constantly playing a game of catch up that they can never win.

Awareness – Because they are only looking at individual points of technology they have a blindness to the end to end transaction. They cannot know how a simple technology problem affects the rest of the technologies or businesses.

Another shortcoming of the current generation of system management is false positives. This is a big problem with simple event based monitoring. You have a storm of alerts. The data centre sees an event saying the queue is full. They call the MQ team who say not to worry about it; it’s just a test queue, or a queue for an application that hasn’t gone live yet. After the first 24 times that this happens the data centre stops paying attention to queue full events. Then the 25th one happens which is for an important transaction which needs to be dealt with immediately and they just ignore it. The company loses business etc and it’s as if they didn’t have a monitoring solution at all. So what we need is a high level of granularity on the queue monitoring based not just on whether a queue is full but what queue it is, who the application owner is, what time of day it is, what applications are reading from the queue etc.

It’s not enough to provide monitoring data, it has to be information. It has to be interpreted in a way that is useful. What we need is dynamic metric based monitoring. The difference between events and metrics or data and information. You need metric based monitoring to create complex events – in context, user specific events that are pre-emptive before a real business problem happens which can be actionable. The problem isn’t getting events, its event correlation with rules etc. You need to watch more than the vendor gives. It can’t be enough.

There is something called the ‘aha phenomenon’. When a problem occurs you spend ages trying to identify the cause, looking at all the queues and middlewares and applications. All the time you’re looking the technology’s not running and the business is losing money. Eventually you find it and say ‘aha!’ Then what happens? Can you easily adapt your monitoring environment to make sure it doesn’t happen again or that you at least don’t have to search again when it does happen. In other words you need dynamic monitoring – where the monitoring environment of event correlation, metric selection and rules application can be constantly updated.

So let’s expand the vision of what we need. We need a unified approach like the service oriented architectures that are so popular for applications i.e. a reusable monitoring architecture. We don’t need a silo or isolated tool – the antithesis of SOAs. It needs to be a business oriented on demand solution. It needs to be modular, extensible, adaptable, scalable and reusable. We need instrumentation for all the different applications and middlewares. And the environment status needs to be shown to all the different stakeholders from their own perspective for their own roles and responsibilities.

By applying the service oriented architecture principles we can achieve the Business Activity Monitoring and business agility that we really need. A business centric solution aligning IT to the business processes so the business can actually benefit from the technology rather than being constrained by it. Using this you can see the impact of a problem from all perspectives and you can rapidly adapt to the changing business and technological environment learning from mistakes. Currently 80% of IT resource is consumed by maintaining the technology. Using this architecture we can free the resources to other products, develop the business and make more money.

In summary, this unified model gives business and technology continuity and automatic recovery. It gives very granular complex events allowing root cause analysis and business impact analysis by being aware of the business processes affected by the technology and displaying the information in a business context giving an improved quality of service.

Of course there are pros and cons to being standards based. Some Service Oriented Architectures such as .Net and WebServices are still in flux. We need unified SOA security across all platforms. To be proactive in the way that is needed will require polling which needs to be configured to avoid performance problems.

But anyway, what I’ve proposed here is a unified model, a base for business activity monitoring. As David Luckham says “The challenge is not to restrict communication flexibility, but to develop new technologies to understand it”. So I propose that the key to dealing with complexity and delivering true business activity monitoring solutions is a unified model based on a service oriented architecture. This doesn’t happen out of the box as no vendor or developer can know all your requirements but it is a framework which is modular, extensible, adaptable, scalable and resuable enough to facilitate what we need.

© Sam Garforth   2005

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s