Our Journey with Capability Monitoring at the FT

FT Product & Technology
FT Product & Technology
8 min readMay 27, 2020

--

By Eric Anand

TL;DR: How do we understand that the things our business cares about are working as intended? We identify and monitor them. Here we dive into the beginning of the FT’s journey to monitor our capabilities, and how we did our first one.

So what’s the issue?

Currently, at the FT, we have extensive monitoring of our underlying systems — which is great, however, we don’t have that same extensive picture at a high-level view.

So what does that mean? Well, our Operations Support team currently monitor a dashboard full of hundreds of tiles which let them understand the state of a microservice depending on how it’s been configured, and whether it’s failing:

The Heimdall Operations dashboard, showing the 550+ system tiles the operations team monitor.

But does system “x” failing even matter? What is it impacting? Do operations need to call someone out to fix it out-of-hours? Can it wait till the morning? If we can understand these questions and monitor the “things” we care about at a high level, then even if underlying system “x” has a red tile, we can judge whether it is impacting our customers’ experience in a big way.

Could this help us with Incidents?

This leads nicely onto live incidents, as understanding the high level picture can allow us to react to these situations better. In many respects, the FT is already thinking about this “high level” picture during an incident. Depending on the scenario, our Operations Support team will often try to gauge customer impact by asking certain questions like:

  • Is the site still up?
  • Can we publish the news?
  • Can our readers read our content?
  • Can we subscribe users?
  • Can we get money from our users?
  • etc…

Having a way to present answers to these questions, by providing monitoring which evaluates them in a quick and repeatable basis, would hugely benefit us during these scenarios, providing clarity to all involved.

How will the Business, Operations, and Delivery Teams Benefit?

From a variety of views, finding a way to monitor and visualise these questions in a single place provides great benefit. For example:

  • The Business: Will be able to understand if the things they care about are working (Eg: The site is up), and not about the state of an underlying system. Visualising all this information in one place will also provide a singular clear picture.
  • The Operations team: Can understand whether a system failing actually means a critical business functionality is failing, and whether to call out support.
  • The Delivery team: Can have confidence about how their underlying systems affect critical business functionality, and whether a fix during an incident has fixed the issue for our customers.

Defining Capabilities at the FT

Now that we can see the value in monitoring the “things” which represent the “higher level picture”, what we need is a name for them, and a way to group them so that we can understand exactly what we care about. This is how we came up with Business Capabilities (or Capabilities for short), which are defined in Biz Ops. Biz Ops is where the FT keeps its information about its business operations, such as the organisational structure, what our products are, and in-depth data about our microservices. All of it queryable via an API.

A list of our Business Capabilities from Biz Ops.

Thus, the aim of Business Capabilities is to help us understand the business functions carried out at the FT which are important to us as a brand or business.

Where can I see these Capabilities?

Now that we’ve got our capabilities defined, we need a place to visualize them. After some consideration about where this would be best to live, and how to quickly get something out there, we decided to create the new “Capabilities Dashboard” into Heimdall (our react-based front-end solution to visualising our monitoring data, named after the all-seeing Norse deity):

The Heimdall capability dashboard, which displays our monitored and unmonitored business capabilities.

This dashboard gets all the Capabilities defined in Biz Ops, and visualises them as “Untracked Capabilities” if they don’t have an end-to-end monitoring test associated with them. But why is there a green tile at the top then? Well turns out behind your back we’d already been working on monitoring a capability…

What do you mean by End-to-End monitoring?

So while we were building this dashboard, we were also liaising with our Membership group in Sofia to get our first monitored capability! Exciting stuff indeed. During Q4 2019, we agreed to work together to understand what Membership-centric capability we could visualise for the upcoming dashboard. During these discussions we made clear that the end-to-end test for this capability:

  • Must be run from a users perspective
  • Must run on production
  • Must run often (at least every 5 minutes)
  • Must send the result to a datasource that Prometheus (the data aggregation layer for our monitoring) could scrape.
  • Is a direct check about the capability — what is the quickest way to understand this capability is working as intended?

We then debated what capability would be a best first fit based on this criteria, and found that the “Users can sign in” capability was a good candidate due to the existing UI Selenium test suite Membership already had available. This test suite was then trimmed down to these two single tests:

  • Can a user login to ft.com via email?
  • Can a user login to ft.com via SSO?

Which looked something like this:

Our end to end testing of users logging in via email. Animated gif.
Our end to end testing of users logging in via single sign on. Animated gif.

These would then form the basis for the monitoring of this capability.

How do these end-to-end tests end up in Heimdall?

So now that the end-to-end tests had been defined, we needed a way to get the results of these tests into Heimdall. As Heimdall acts as the “view layer” of our monitoring aggregation stack, the end-to-end test would need to make its way through the Prometheus “data aggregation” layer first:

An architecture diagram showing the components of our monitoring system.

As can be seen, we would need to ingest these end-to-end tests using one of our existing monitoring sources, however none of the monitoring sources available seemed like a right fit for this scenario. Thankfully, we were just about to start work on incorporating another type of monitoring source into Heimdall, which would fit the bill.

Grafana Alerts to the rescue

There’s something about a graph that is just so much better than seeing a raw data output. FT Technology certainly thinks so at least, as we did a quick discovery spike around whether we should add Grafana alerts into Heimdall. We found that there are around 350 Grafana alerts configured across all FT Grafana dashboards(!) — that’s a lot of monitoring that we’re not surfacing to our Delivery Teams and Operations Support!

What Grafana has above all our other monitoring sources too is the ability to incorporate different monitoring sources easily. Grafana at the FT already incorporates a host of 3rd party data sources from AWS metrics, Graphite, Prometheus, etc…. and the UI makes it easy to incorporate your metrics and visualise them.

So knowing that our colleagues at Membership had an understanding of how to get their login end-to-end results into Graphite, we knew that if we put some work into incorporating Grafana alerts into our data layer Prometheus, we could ultimately get the data showing in the Capabilities Dashboard in Heimdall.

A graph depicting the state of a login end-to-end test over time.

After we got our Grafana alerts into Prometheus, we utilised the tagging feature on the Grafana dashboard, allowing us to attribute a dashboard to a capability. The tag format for this was “capabilitycode_<business-capability-code-from biz-ops-here>”, for example:

The use of the tagging feature in grafana to allow us to attribute capabilities to grafana alerts.

So that when Prometheus scraped this dashboard, it would then grab this metadata, attribute it to the end-to-end check(s) on the dashboard, and add those checks to the correct tile in Heimdall — all based on the capability code.

Special mention to our Secondees/Bootcampers

Now while our team would like to take all the credit, we must give a special mention to our London and Sofia secondees who helped us make this possible:

  • Alexander Marzan joined us in Q4 2019, and did a bootcamp with us to make a fantastic start to the Grafana exporter which we could then build upon.
  • Yana Todorova and Georgi Ivanov joined us in Q4 2019, helping us standardize the Monitoring Aggregation stack, as well as actively developing on the Grafana Exporter and Heimdall.
  • Vasil Arnaudov joined us in Q1 2020, and shaped the end-to-end monitoring required for the login capability, allowing us to display our first real capability on the dashboard.

The help of these individuals and the collaboration with their respective teams — mostly remotely — shows we can work successfully together, and will hopefully lead us to more collaboration opportunities in the future.

This is not the end, just the beginning

This is very much the start of our journey at the FT when it comes to monitoring capabilities. We now:

  • Understand what we want from an end-to-end monitoring test for a capability
  • Have a working template to ingest the result of this test into our data aggregation layer, and
  • Have the ability to display the result of this data on our capabilities dashboard

So all we need to do is to get together with more teams in our organization to monitor more capabilities, just like we did with Membership, to bring more green tiles to life! and perhaps some red ones…

If you’re at the FT, be sure to look out for us! Or feel free to come chat to the team anytime on the #reliability-eng channel.

By Eric Anand (Senior Engineer)

--

--