Are Prometheus & Grafana Sufficient To Support Modern IT?

Originally published here on October 21, 2022.

The Prometheus and Grafana combination is rapidly becoming ubiquitous in the world of IT monitoring. There are many good reasons for this. They are free open source toolkits, so easy to get hold of and try out and so there is a lot of crowd sourced help available online to getting started, and this even includes documentation from the developers of middleware such as IBM MQ and RabbitMQ.

At Nastel, where we focus on supporting integration infrastructure for leading edge enterprise customers, we are finding that many of our customers have made policy decisions of using Prometheus and Grafana across the board for their monitoring. However, they are finding that it’s not sufficient in all situations.

Business Agility and Improving Time to Market

Speed is King. The business is constantly requesting new and updated applications and architecture, driven by changes in customer needs, competition, and innovation. Application developers must not be the blocker to business. We need business changes at the speed of life, not at the speed of software development.

For Black Friday, a large ecommerce site that typically has 3,000 concurrent visitors suddenly had to handle 1 million in a day! How can they handle 300 times as many visitors? If they can’t cope then this could change from black Friday to a very red one with a very public outage.  A high-profile loss with serious reputational damage.

Evolving IT

IT is constantly evolving. Most companies moved their IT to the agile development methodology and then they added DevOps with automation, continuous integration, continuous deployment, and constant validation against the ever-changing requirements.

With agile, companies reduced application development time from two years to six months. With DevOps it went down to a month, and now adding in cloud, companies like Netflix and Eli Lilly can get from requirements, to code, to test, to production in an hour. They’ve evolved their architectures from monolithic applications to service oriented architectures to multi-cloud, containers and microservices. Microservices can quickly get pushed out by DevOps, moved between clouds, they can be ephemeral and serverless and containerized. They use hyperscale clouds, so named because they have the elasticity to grow and shrink based on these dynamic needs. They have stretch clusters and cloud bursting so that burstable applications can extend from on-premise into a large temporary cloud environment with the necessary resources for Black Friday and then scale down again.

multi-cloud-transaction-tracking-500x285

So now a single business transaction can flow across many different services, hosted on different servers in different countries, sometimes managed by different companies. The microservices are continuously getting updated, augmented, and migrated in real time. The topology of the application can change significantly on a daily basis.

Supporting Agile IT

So how is all this IT supported and how do we know if it is working? How do we get alerted if there is a break down in the transaction flow or if a message gets lost? How can you monitor an application that was spun up or moved for such a short period of time?

By building Grafana dashboards on top of the Prometheus platform, a monitoring, visualization and alerting solution can be constructed which provides visibility of the environment. The question is whether this can be built and adjusted fast enough to keep up with the everchanging demands of the business and IT.

Nastel’s Integration Infrastructure Platform

Nastel’s Integration Infrastructure Management platform addresses this. Its XRay component dynamically builds a topology view based on the real data that flows through the distributed application. It gives end to end observability of the true behavior of your application in a constantly changing agile world. It highlights problem transactions based on anomalies, and enables you to drill down to those transactions, carry out root cause analytics and perform remediation. The Nastel solution receives data from across the IT estate, including from Prometheus, and allows rapid creation and modification of visualizations and alerts in a single tool.

Furthermore, the Nastel technology adds in almost 30 years experience of supporting large production middleware environments. It has deep granular knowledge of the technologies and issues, and uses learned and derived data to provide AIOps and Observability, in addition to traditional event based monitoring.

Enhancing your existing infrastructure

Companies use Nastel’s Integration infrastructure Management (i2M) solution to enhance their existing tools. We take a proactive SRE approach based on an in-depth understanding of the middleware and monitoring history to prevent the outage altogether by monitoring key derived indicators such as:

  • Latency – time to service a request.
  • Traffic – demand placed on system
  • Error Rate – rate of Failed Requests
  • Saturation – which resources are most constraint

Prometheus & Grafana can give high level monitoring of a static environment, but they are two separate products and time is money. Ease of fixing issues without war rooms and handovers from support back to development is time critical. Do you have enough diagnostic data, skills or the tooling? Is your monitoring provider able to support you on the phone with middleware expertise throughout the outage? Just how important is your integration infrastructure? The Nastel Platform includes Nastel Navigator to quickly fix the problem too.

As business requirements change it is crucial to be able to change the dashboards and alerts in line with this. Nastel XRay is built with this as a core focus. With its time series database, the dashboard and alerting are all integrated together. The dashboards are dynamically created and change automatically as the infrastructure and message flows change.  This requires minimal time for set-up, with monitoring not being a blocker to business. Rather than asking for a screenshot of a monitoring environment in use, ask for a demo of how long it takes to build a dashboard.

XRay-Tracking-GIF

Nastel is the leader in Integration Infrastructure Management (i2M). You can read how a leading retailer used Nastel software to manage their transactions here and you can hear their service automation leader discussing how he used Nastel software to manage changing thresholds for peak periods here.

Is your IT environment evolving like this? What are your experiences with Prometheus and Grafana? Please leave a comment below or contact me directly and let’s discuss it.

Why You Need APM—and How it Works

Originally published here on March 8, 2022.

There’s a lot to consider when engineering and implementing software, whether as an update patch or a newly-introduced product. End users have certain expectations when introduced to new or updated software—at the top of the list are aesthetics, ease of use, stability, and response time—the last two of which can be significantly improved when you employ application performance management or APM.

What Does APM Do?

The primary purpose of APM is to improve response time between a source application, the final destination, and ultimately—in many cases— back to the source. The source and destination may vary. In one situation, the source and the destination may be a mobile game app—such as Jem Junkies, a fictional puzzle game.

In another situation, this time with fictional online auction site BiddingBuddies.com, the initial source is a web page. The initial destination is the server on which all Bidding Buddies data resides. In this case, the source and destination change, depending on which entity is sending and receiving the data.

Both situations are dramatically different, but their ultimate technical objectives are very similar: to offer end users the best experience they can provide so they’ll come back and play again.

A key component for maintaining the end user’s satisfaction is response time. If an app is slow, if an auction site continually hoses the bidder out of a final bid due to the poor reflexes of the site, end-users will move on. Nobody wants that. APM is put into place at each site to inject new agility and quicker response times into their systems.

APMs can employ either passive or active monitoring, and the monitoring method used depends on the needs of the software.

APM Passive Monitoring Watches — But Doesn’t Fix

You’re playing Jem Junkies, your favorite puzzle game, on your phone. You’ve reached level 36, and you go to click on a gem—and nothing happens. Just as the gem finally lights up as if to say, “You can do what you want with me now!” your board blows up, because your timer ends a split second before you can move it. This isn’t the first time this has happened, and you’ve been stuck on level 36 for far too long.

Now, Jem Junkies doesn’t maintain a continuous endpoint connection with user devices. It’s a game that, once installed, remains resident on the user’s device and occasionally receives an update. For the player of this game, the response time is dependent entirely on the app’s interactions between the device and the end-user.

The APM for this game is pretty much self-contained within the app itself, with only occasional interaction with Jem Junkies headquarters. When the user agrees to the EULA, the app receives permission to send data to the mothership for purposes of improving and updating the user’s gaming experience. Once the user gives their permission, an additional module is installed along with the game itself.

Each time you open Jem Junkies, the installed module tracks each game session, taking note of things like device specs, device location, and the username that’s tied to the game. The module monitors activity like how long you played, which level you reached, how many times you play each day, and what time of day you most often play. The module holds onto this information, and once you open the game again to play, it uploads the information of the last play session to Jem Junkies headquarters, where all of the data from all of the users who play Jem Junkies is aggregated. This information can be called up into a database and arranged into human-readable tables for QA techs to peruse.

Through painstaking scrutiny provided by a pattern recognition algorithm, QA can discern that an abnormally high percentage of users who play the game on a particular smartphone with a particular operating system seems to stop playing at level 36. QA also notices that these same users open the app repeatedly to level 36 but can’t seem to get past it.

To their dismay, QA further notices that a good 25 percent of these same users haven’t opened the app in three days or more—they seem to have given up and moved onto another game, which Jem Junkies cannot track. The marketing team is not going to like this.

The company immediately gets game testers on it and soon finds out that when a player plays the game on one particular phone with one particular operating system, about midway through level 36, whatever gem lands in column 3, row 12 has a tendency to get stuck. The gem will light up, but it will not move when the user attempts to drag it across the screen, which is the whole point of the game!

It turns out that the issue wasn’t that the level was unwinnable. It was just that the level became unplayable.

Testers inform the development team of the issue with the particular phone brand and operating system. Developers scour through the code and find the culprit: a misplaced semicolon. Had the semicolon been placed one character to the right, this level would have been easily winnable.

A programmer fixes the bug, pushes an update to the offending operating system, and you’re finally able to move that gem and get out of level 36.

This example highlights Passive Monitoring. The module collects data as the user plays the game then sends the data to a location where the information is parsed and interpreted by someone else. Although the app provides information crucial to diagnosing and repairing issues, all diagnostics and fixes are left to the software’s QA and development teams to figure out.

Active Monitoring Can Diagnose and Fix Many Problems

We can almost feel the dismay and defeat that you experienced when, in the last two seconds of a Bidding Buddies war, you missed out on that oversized Elvis face pillow because your final click—which you made just a moment before the timer ended—didn’t register until two seconds after the timer was done.

You can blame your home network, which is having another bad hair day, or you could blame biddingbuddies.com for the lag they experience on the reg because they have a substandard or non-existent APM system.

Bidding Buddies gets wind of the frequent lagging problem and decides to employ an automated monitoring service that inserts virtual bots into their system. They immediately begin to follow the flow of the site’s traffic. These bots report errors and anomalies back to a bank of monitors, attended by a systems administrator or some other tech-savvy interweb guru. This data can reveal any number of pain points within the auction site’s internal system.

Although many APM software have onboard tools integrated into the overall system that can execute improvement and repair functions with little to no human intervention, such systems can be quite spendy, so Bidding Buddies decides that they will go with just the monitoring function and leave it up to the systems administrators to come up with the solutions.

With APM systems that utilize this type of active monitoring structure, admins can quickly identify, locate, and diagnose issues that can affect the response time performance of their entire architecture. They can keep watch on the flow of information between their database and their customer-facing app and locate patterns in traffic flow.

For instance, Bidding Buddies admins find that they experience a midday increase of traffic as people across the country spend their lunch break engaging in a virtual battle for the right to possess random knick-knacks, baubles, and gewgaws. Admins are then able to determine a corresponding slow down—the response times between end users’ work computers and the Bidding Buddies servers become longer and longer, until around 1:00 PM Bidding Buddies time, at which point traffic begins to fall off and response times improve once more.

APM that employs active monitoring allows admins to quickly determine when certain load thresholds slow down the internal network. With this knowledge, they know that they may need to request additional servers to handle the high-traffic overflow. The admins can decide that their server capacity is just right. Still, there may be a bottleneck in their routing, so they can add physical switches or implement automated redirects when the traffic load gets too high.

The active monitoring system Bidding Buddies employs searches out patterns that can cause a breakdown in application performance. This type of active monitoring service can diagnose many problems—and with the right integration, it can also improve performance by actively repairing some of those breakdowns with little or no need for human intervention.

In the case of Bidding Buddies’ lunchtime slowdown, admins were able to pin down a temperature spike that approaches dangerous thermal levels on a specific bank of servers at a certain time of day. This information helps them trace the rising heat levels to Phil from accounting, who naps in the server room at lunchtime and blocks a particularly critical AC vent with his sleeping bag.

It was Phil the whole time.

APM Saves the Day

While these are just two fairly rudimentary examples of APM, they highlight the necessity for vigilant monitoring and management of any given system’s performance. The experiences with Jem Junkies and Bidding Buddies—or very similar issues—aren’t all that uncommon. They hit mobile and PC gamers, online bidders, and purchasing agents every day.

Jem Junkies implemented their APM in short order and managed to woo back many of their lost players with a promise of 100 Jem Bucks (redeemable only in-app for powerups, of course).

Likewise, Bidding Buddies could use their APM system to finally rid themselves of that pesky Phil from Accounting. Had biddingbuddies.com implemented the right APM much sooner, you might be snuggling with your Elvis pillow right now. We would offer you ours, but we won the bid fair and square.