Are Prometheus & Grafana Sufficient To Support Modern IT?

Originally published here on October 21, 2022.

The Prometheus and Grafana combination is rapidly becoming ubiquitous in the world of IT monitoring. There are many good reasons for this. They are free open source toolkits, so easy to get hold of and try out and so there is a lot of crowd sourced help available online to getting started, and this even includes documentation from the developers of middleware such as IBM MQ and RabbitMQ.

At Nastel, where we focus on supporting integration infrastructure for leading edge enterprise customers, we are finding that many of our customers have made policy decisions of using Prometheus and Grafana across the board for their monitoring. However, they are finding that it’s not sufficient in all situations.

Business Agility and Improving Time to Market

Speed is King. The business is constantly requesting new and updated applications and architecture, driven by changes in customer needs, competition, and innovation. Application developers must not be the blocker to business. We need business changes at the speed of life, not at the speed of software development.

For Black Friday, a large ecommerce site that typically has 3,000 concurrent visitors suddenly had to handle 1 million in a day! How can they handle 300 times as many visitors? If they can’t cope then this could change from black Friday to a very red one with a very public outage.  A high-profile loss with serious reputational damage.

Evolving IT

IT is constantly evolving. Most companies moved their IT to the agile development methodology and then they added DevOps with automation, continuous integration, continuous deployment, and constant validation against the ever-changing requirements.

With agile, companies reduced application development time from two years to six months. With DevOps it went down to a month, and now adding in cloud, companies like Netflix and Eli Lilly can get from requirements, to code, to test, to production in an hour. They’ve evolved their architectures from monolithic applications to service oriented architectures to multi-cloud, containers and microservices. Microservices can quickly get pushed out by DevOps, moved between clouds, they can be ephemeral and serverless and containerized. They use hyperscale clouds, so named because they have the elasticity to grow and shrink based on these dynamic needs. They have stretch clusters and cloud bursting so that burstable applications can extend from on-premise into a large temporary cloud environment with the necessary resources for Black Friday and then scale down again.

multi-cloud-transaction-tracking-500x285

So now a single business transaction can flow across many different services, hosted on different servers in different countries, sometimes managed by different companies. The microservices are continuously getting updated, augmented, and migrated in real time. The topology of the application can change significantly on a daily basis.

Supporting Agile IT

So how is all this IT supported and how do we know if it is working? How do we get alerted if there is a break down in the transaction flow or if a message gets lost? How can you monitor an application that was spun up or moved for such a short period of time?

By building Grafana dashboards on top of the Prometheus platform, a monitoring, visualization and alerting solution can be constructed which provides visibility of the environment. The question is whether this can be built and adjusted fast enough to keep up with the everchanging demands of the business and IT.

Nastel’s Integration Infrastructure Platform

Nastel’s Integration Infrastructure Management platform addresses this. Its XRay component dynamically builds a topology view based on the real data that flows through the distributed application. It gives end to end observability of the true behavior of your application in a constantly changing agile world. It highlights problem transactions based on anomalies, and enables you to drill down to those transactions, carry out root cause analytics and perform remediation. The Nastel solution receives data from across the IT estate, including from Prometheus, and allows rapid creation and modification of visualizations and alerts in a single tool.

Furthermore, the Nastel technology adds in almost 30 years experience of supporting large production middleware environments. It has deep granular knowledge of the technologies and issues, and uses learned and derived data to provide AIOps and Observability, in addition to traditional event based monitoring.

Enhancing your existing infrastructure

Companies use Nastel’s Integration infrastructure Management (i2M) solution to enhance their existing tools. We take a proactive SRE approach based on an in-depth understanding of the middleware and monitoring history to prevent the outage altogether by monitoring key derived indicators such as:

  • Latency – time to service a request.
  • Traffic – demand placed on system
  • Error Rate – rate of Failed Requests
  • Saturation – which resources are most constraint

Prometheus & Grafana can give high level monitoring of a static environment, but they are two separate products and time is money. Ease of fixing issues without war rooms and handovers from support back to development is time critical. Do you have enough diagnostic data, skills or the tooling? Is your monitoring provider able to support you on the phone with middleware expertise throughout the outage? Just how important is your integration infrastructure? The Nastel Platform includes Nastel Navigator to quickly fix the problem too.

As business requirements change it is crucial to be able to change the dashboards and alerts in line with this. Nastel XRay is built with this as a core focus. With its time series database, the dashboard and alerting are all integrated together. The dashboards are dynamically created and change automatically as the infrastructure and message flows change.  This requires minimal time for set-up, with monitoring not being a blocker to business. Rather than asking for a screenshot of a monitoring environment in use, ask for a demo of how long it takes to build a dashboard.

XRay-Tracking-GIF

Nastel is the leader in Integration Infrastructure Management (i2M). You can read how a leading retailer used Nastel software to manage their transactions here and you can hear their service automation leader discussing how he used Nastel software to manage changing thresholds for peak periods here.

Is your IT environment evolving like this? What are your experiences with Prometheus and Grafana? Please leave a comment below or contact me directly and let’s discuss it.

Sam’s Views on Cloud for Government Policy Makers

I was honoured to be asked to present yesterday on “Cloud Skills, Flexibility and Strategy” at the Westminster eForum Keynote Seminar: Next steps for cloud computing.

English: The Palace of Westminster from Whitehall.

English: The Palace of Westminster from Whitehall. (Photo credit: Wikipedia)

As explained on its website, Westminster Forum Projects enjoys substantial support and involvement from key policymakers within UK and devolved legislatures, governments and regulatory bodies and from stakeholders in professional bodies, businesses and their advisors, consumer organisations, local government representatives, and other interested groups. The forum is structured to facilitate the formulation of ‘best’ public policy by providing policymakers and implementers with a sense of the way different stakeholder perspectives interrelate with an aim is to provide policymakers with context for arriving at whatever decisions they see fit.

The abstract to the session asked about the extent to which Government departments embracing the cloud, what progress is being made in achieving the UK’s Data Capability Strategy on skills and infrastructure development, whether organisations are doing enough to address the emerging shortfall in skills and also asked about the contradiction between mobile device power and cloud.

I was part of a panel and the following was my five minute introduction.

In my five minutes I’d like to talk about the power of cloud and within that to address three areas raised in the abstract to this session – shared services and shared data; mobile; and skills.

We see cloud as being used in three different ways – optimisation, innovation and disruption. Most of what I’ve seen so far in cloud adoption is about optimisation or cost saving. How to use standardisation, automation, virtualisation and self service to do the same things cheaper and faster.

What’s more interesting is the new things that can be achieved with the innovation and disruption that this can provide.

I’ve been working with various groups – local authorities, police forces, and universities, discussing consolidating their data centres. Instead of each one managing their own IT environment, they can share it in a cloud. They justify this with the cost saving argument but the important thing is, firstly, that they can stop worrying about IT and focus on what their real role is, and secondly that by putting their data together in a shared environment they can achieve things that they’ve never done before.

English: The road to Welton, East Riding of Yo...

English: The road to Welton, East Riding of Yorkshire, just south of Riplingham. Taken on the Riplingham to Welton road at MR: SE96293086 looking due south. This is typical south Yorkshire Wolds country. (Photo credit: Wikipedia)

For example, Ian Huntley would never have been hired as a caretaker and so the Soham murders would have been less likely to happen if the police force had access to the data that he was known by a different force.

And we wouldn’t have issues with burglars crossing the border between West and North Yorkshire to avoid detection if data was shared.

In Sunderland we predict £1.4m per year in cost savings by optimising their IT environment but what’s more important is that this has helped to create a shared environment for start up companies to get up and running quickly so it’s stimulating economic growth in the area.

Another example is Madeleine McCann. After her disappearance it was important to collect holiday photos from members of the public as quickly as possible. Creating a website for this before cloud would have taken far too long. Nowadays it can be spun up very quickly. This isn’t about cost saving and optimisation, it’s about achieving things that could never have been done before.

This brings me to the question in the abstract about mobile: “As device processing power increases, yet cloud solutions rely less and less on that power, is there a disconnect between hardware manufacturers and app and software developers”. I think this is missing the point. Cloud isn’t about shifting the processing power from one place to another, it’s about doing the right processing in the right place.

English: GPS navigation solution running on a ...

English: GPS navigation solution running on a smartphone (iphone) mounted to a road bike. GPS is gaining wide usage with the integration of GPS sensors in many mobile phones. (Photo credit: Wikipedia)

In IBM we talk about CAMS – the nexus of forces of Cloud, Analytics, Mobile and Social, and we split the IT into Systems of Record and Systems of Engagement. The Systems of Record are the traditional IT – the databases that we’re talking about moving from the legacy data centres to the cloud. And, as we’ve discussed, putting it into the cloud means that a lot of new analytics can happen here. With mobile and social we now have Systems of Engagement. The devices that interact with people and the world. The devices that, because of their fantastic processing power, can gather data that we’ve never had access to before. These devices mean that it’s really easy to take a photo of graffiti or a hole in the road and send it to the local council through FixMyStreet and have it fixed. It’s not just the processing power, it’s the instrumentation that this brings. We now have a GPS location so the council know exactly where the hole is. And of course this makes it a lot easier to send photos and even videos of Madeleine McCann to a photo analytics site.

We’re also working with Westminster council to optimise their parking. The instrumentation and communication from phones helps us do things we’ve never done before, but then we move onto the Internet of Things and putting connected sensors in parking spaces.

With connected cars we have even more instrumentation and possibilities. We have millions of cars with thermometers, rain detection, GPS and connectivity that can tell the Met Office exactly what the weather is with incredible granularity, as well as the more obvious solutions like traffic optimisation.

Moving on to talking about skills. IBM has an Academic Initiative where we give free software to universities, and work with them on the curriculum and even act as guest lecturers. With Imperial College we’re proving cloud based marketing analytics software as well as data sets and skills, so that they can focus on teaching the subject rather than worrying about the IT. With computer science in school curriculums changing to be more about programming skills we can offer cloud based development environments like IBM Bluemix. we’re working with the Oxford and Cambridge examination board on their modules for cloud, big data and security.

Classroom 010

Classroom 010 (Photo credit: Wikipedia)

To be honest, it’s still hard. Universities are a competitive environment and they have to offer courses that students are interested in rather than ones that industry and the country need. IT is changing so fast that we can’t keep up. Lecturers will teach subjects that they’re comfortable with and students will apply for courses that they understand or that their parents are familiar with. A university recently offered a course on social media analytics, which you’d think would be quite trendy and attractive but they only had two attendees. It used to be that universities would teach theory and the ability to learn and then industry would hire them and give them the skills, but now things are moving so fast that industry doesn’t have the skills and is looking for the graduates to bring them.

Looking at the strategy of moving to the cloud, and the changing role of the IT department, we’re finding that by outsourcing the day to day running of the technology there is a change in skills needed. It’s less about hands on IT and more about architecture, governance, and managing relationships with third party providers. A lot of this is typically offered by the business faculty of a university, rather than the computing part. We need these groups to work closer together.

To a certain extent we’re addressing this with apprenticeships. IBM’s been running an apprenticeship scheme for the last four years This on the job training means that industry can give hands on training with the best blend of up to the minute technical, business and personal skills and this has been very effective, with IBM winning the Best Apprenticeship Scheme from Target National Recruitment Awards and National Apprenticeship Services and Everywoman in technology.

In summary, we need to be looking at the new things that can be achieved by moving to cloud and shared services; exploiting mobile and the internet of things; and training for the most appropriate skills in the most appropriate way.

What is Cloud Computing? Is everything cloud?

Cloud wordleCloud is consumption model. It’s the idea of taking away all the IT skills and effort required by a user and letting them focus on their actual functional requirements. All the IT detail is hidden from them in The Cloud. Smart phones and tablets have really helped consumers understand this concept. They’ve become liberated. Knowing very little about IT they have become empowered with self-service IT to access functionality on demand. Within seconds they can decide that they want a business application, they can find it on an app store, buy and install it themselves and be up and running using it. When they’ve finished they can delete it.

CIOs are asking themselves why it can still take IT many months to get their business project up and running when in their personal lives they can have what they want when they want it.

The Cloud doesn’t take away the need for IT; for hardware, software, and systems management. It just encapsulates it. It puts it in the hands of the specialists working inside the cloud, and by centralising the IT and the skills costs can be reduced, risk can be reduced, businesses can focus on their core skills and have improved time to market and business agility.

It is confusing to talk about cloud without explaining whose point of view you’re looking at it from. Different people want different levels of complexity outsourced to the cloud.

Many users see cloud as a way of outsourcing all their IT. Some go even further and outsource the whole business process. I think the jury is out on whether cloud has to involve IT at all. Business Process as a Service (BPaaS) is talked about as one of the cloud offerings. I think the important thing is to let the customer get on with their core business and take away any activity that is not a differentiator for them.

Software as a Service (SaaS) is the area that most people think about first when they hear the word cloud. People have been using web based email for over 10 years. They don’t need to worry about maintaining a high spec PC and all the associated software. As long as they have a web browser they’re up and running. There is a move and a demand to make many, if not all, computer software applications available on the cloud, via simple consoles. Not unlike the idea of thin clients 15 years ago or mainframe terminals 40 years ago.

Moving down the stack a little further we come to a different group of users; the application developers. The people who want to be involved in IT, who want to create the business applications that run on the cloud. They still want to focus on business value though. They still want someone to take away the effort of writing the middleware. The code that is the same in 90% of all applications. The communication systems, the database, the interaction with the user. They want Platform as a Service (PaaS). An environment that’s just there, up and running, as and when they need it.

Finally we come to Infrastructure as a Service (IaaS). This is for real programmers or system administrators. For people who just want the base operating system to install or write the applications on, like they did in the old days. These people like the paradigm, of having a computer that’s all theirs. In the old days when their CIO wanted an environment for a new project they would request that someone find some data centre space, buy a PC, install it in the data centre with power and cooling etc, install the operating system, and then 6 months later hand it over to them to start the project development. Now they don’t need to worry about the physical world. They can just request the infrastructure as a service i.e. access to a brand new operating system install, and they’ll be up and running in minutes.

Or course these things can all run on top of one another. The business process can run on the software which runs on the platform which runs on the infrastructure, all provided as a service. But they don’t have to. The whole point is that the user doesn’t need to worry about what’s happening inside of their cloud. There could just be an infinite number of monkeys with an infinite number of typewriters working away inside the cloud. As long as the user is getting the service that they’re looking for they don’t care.

Which brings us to the other side of the picture. The cloud service providers. These can be traditional Managed Service Providers (MSPs), System Integrators, or the in house data centre offering the IT service to the lines of business. These guys are already taking the IT effort away from businesses, they’re already encapsulating and obfuscating the details of IT. But they’re in a competitive market, driven by the new expectations of the consumer and so they need to work smarter. They need to adopt some of these new architectures to be able to pass on the cost benefits and speed of delivery that their customer expects.

This is where some of the other terms associated with cloud computing come in – virtualisation, automation, standardisation. They’re not essential for cloud computing. The monkeys could do the job. But they really make it a lot easier. To make a step change improvement in delivery speed the IT departments need to share the environments on the same computer. Instead of having hundreds of servers running at 50% capacity they can just have one bigger one and schedule who’s using the capacity when. Instead of manually installing the application and all its dependencies and bug fixing and testing each part individually and together, they can standardise and automate and use virtual appliances to remove room for error. By introducing virtualisation, automation and self service a private data centre is moving towards and enabling cloud. Similarly, pay as you go, and sharing services between companies, are not cloud per se, but they are drivers towards and benefits from cloud.

So it can be confusing. People are talking about the same thing, but from different points of view. When people talk about cloud they might be talking about the hardware and automation in the data centre, or they might be talking about the complete absence of hardware by using business process outsourcing, they might be talking about handing all their data over to another company or they might be talking about making their private data more accessible to their own users.

So cloud covers a lot, but not everything.