Designing for observability

Observability is an alarmingly underestimated quality of solution architecture, because it’s never the ‘business’ who asks for it. But it’s absolutely essential for achieving good auditability and operability, and it helps you monitor your capacity, performance and availability.

Retrofitting an existing architecture with observability features can be unnecessarily costly. But if you design for it from the start, you can build in observability seamlessly, elegantly, efficiently, and save time during testing and early operations to win back some of the costs of adding observability.

While observability is a term in control theory, its definition is beautifully simple and easily translatable to software design: ‘Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs’.

You design for observability by adding the following features (that provide ‘external outputs’) to your solution architecture:

  • Logs
  • Metrics
  • Heartbeats
  • Dashboards
  • Reports

Let’s look at how each of these features can be added to your architecture. Consider a simple service designed to allow client systems initiate collection, which ultimately requires integration with the ERP system:

The client system calls a synchronous service to initiate a collection (sending an invoice and collecting payment). Valid requests are turned into messages written to a queue.

A backend service picks up the message, generates and sends the invoice, and posts the transaction in the ERP system. Once payment is received, the ERP calls a service, which publishes an update back to the subscribing client system.

This is a pattern commonly used in service-oriented architectures. Queueing serves two purposes. Firstly, it relieves the client system of having to wait for the invoice to be generated, sent and posted. Secondly, it makes the client service available even when the ERP is down for maintenance.

But this design doesn’t provide great observability. Staring at the humming servers doesn’t tell you if it’s currently processing anything, and if so, if things are going well. You can log into the ERP system and see that new postings have been added or call your colleagues who use the client system and ask if they are sending any invoices and getting some payment notifications back. This is not a great observability quality! So, let’s add logging!

Logs

Adding logging to the collection system could look like this:

To add logging, first you need to set up a log system. This can be a simple database table or a cloud-based service.

Then you need to add code to all services that write their received requests and the responses they return to the log. Since I already use queuing in the base design, I prefer the just write each log entry in a message to a queue and have a helper service write them to the actual log system. This isolates the dependency of the currently chosen log technology to a single service.

Logs should contain exact copies of all incoming and outgoing data. As such, logs are likely to contain sensitive information, so access must be limited to the fewest possible carefully vetted system operators.

Logs usually require an enormous amount of storage space, so log data can’t be retained for longer periods. On the other hand, you mostly only look in logs to investigate an incident, so in most cases you will be find what you need with as little as three months’ worth of log data.

Short retention time is also the main reason logs aren’t useful as information sources for long-term performance analysis of your solution. Either you always just see for example three months back, or you have to extract and aggregate data and store it elsewhere, where you can keep it for longer periods without exhausting all available storage space. But you lose details in the aggregation that can’t be recreated later.

Also, log data structures change frequently as you fix bugs and add new functionality. Therefore, extracting metrics from logs may require frequent adjustments as well, adding to the long-term cost of maintaining the necessary degree of measurability.

Finally, even if you try to piggy-back on your logs for analyzing your system’s performance, your analysis will be slowed down from traversing large volumes of data to extract the few really relevant measurements you need.

Most companies I know have chosen to send their logs to Splunk, either running on locally virtualized servers or on remotely hosted clouds. Their pricing appears to be more suitable for large data volumes with short retention time, and a company wants some degree of predictability in their infrastructure costs.

Regardless of your choice I recommend that you maintain separate setups for logs and metrics. Let’s add metrics!

Metrics

In my early days as a developer, we always only monitored a few server metrics like CPU utilization, disk usage, thread counts and memory usage, to gauge how close our software was to crash the machine.

In 2014-15 I helped maintain a large system, where such a makeshift dashboard was already put in place. The customer had very specific performance and capacity demands, and there was no way to demonstrate conformity to those demand using the limited data available in the dashboard. So, I designed a stopwatch service with an underlying database to collect metrics – I published an article about it over at LinkedIn: Adding metrics to your legacy code.

As the article’s title suggests, the stopwatch service was added to an old architecture with the intention of limiting the necessary design changes to existing code. It worked (still does) but I have since then refined how I capture metrics in newer designs.

A couple of recent technology shifts have helped pave the way for a very efficient and elegant approach.

  • Cloud-based logging services like elastic, Loggly and Splunk have gained traction and are often available in the infrastructures I’m targeting in my newer designs. They are easier to send metrics to and come with advance built-in reporting and dashboard capabilities.
  • Microservices have become the norm in SOA landscapes. While my new approach to metrics doesn’t depend on services being microservices, building smaller (but therefore more) services tend to result in metrics that more naturally reflect the flows in the system.
  • Message queue managers are usually available in the infrastructures I target in my newer designs, and with solid open-source choices like RabbitMQ (available as a service from CloudAMQP), the technology can be added without enormous costs. MQ managers don’t contribute to the reporting and dashboard capabilities specifically but serve other essential purposes in a SOA landscape. And when you utilize them into your design, they must be included in your metrics setup, and lucky queuing can also play a useful role in the metrics setup itself.

Now, as a solution architect, most often working in large corporations, access to servers is highly compartmentalized and therefore typically out of my reach. And multiple systems run on the same shared servers, which means that server metrics don’t say anything about your particular system’s utilization of the server’s resources.

Adding metrics to the collection system could look like this:

Just as you added code to add logging, you now add even more code to record start and stop times and emit one or more metrics from every service in your architecture.

As with logs, metrics could be store in a simple database table, or you can set up a NoSQL database, which would most likely perform better. You can also use a cloud-based service.

As with logging, I prefer to emit metrics by writing them as messages to a queue and have a helper service write them to the currently chosen metrics system to isolate the technology dependency.

After much experimentation and refinement, I have settled on the following metrics structure:

FieldTypeAverage sizeDescription
Idstring36Unique identifier of the metric, usually a UUID.
Timestamptimestamp6This timestamp should indicate when this metric was written to the metrics database. Comparing this timestamp with the Stop timestamp enables you to monitor the delay in getting measurements stored and available for analysis.

If you see frequent and long delays you can’t expect to establish a worthwhile dashboard to show current activity in the system.
Chainstring36Unique identifier of the measured activity, usually a UUID. The activity often starts when a client system calls one of the system’s client services, which in turn calls other backend services to carry out all the orchestrated steps of that activity.

The Chain identifier should be passed to every called backend service, enabling each invoked service to specify the same Chain identifier in all its own metrics.

Filtering on the Chain field results in a list of all the metrics generated throughout the system as the activity was carried out by multiples services and other subsystems. This helps understand how long it takes to complete complex series of steps in the flow of the activity.
Environmentstring10This field specifies the environment in which the measured activity is executing.

In most of my projects, typical environment values are DEV, TEST, PREPROD and PROD. Even though you might save metrics separately within each environment I still recommend that you keep and populate this field. It will help when doing comparative analysis of similar metrics across the different environments.
Clientstring10This field identifies the (usually external) system initiating or requesting the activity.

As the architect of your solution, you will be assigning Client identifies to any connecting external system, and you may assign some Client values to internal technical features like heartbeats.

Normally, I wouldn’t bother to prevent externally systems accidentally identifying themselves using another client system’s identifier, but if you feel there’s a risk of misuse or integrity issues, you might be able to validate the calling system’s Client identifier against know IP addresses or other security-related tokens or keys.
Referencestring36The Reference field should carry the external client system’s own identifier of the initiated activity. I always urge client system developers to use a UUID.

The Reference field doesn’t really contribute to measurability, but it certainly increases its testability. During testing, the metrics are more readily available than logs, and being able to also see the calling system’s identifiers helps recognize specific transactions (rather their metrics) so they can be followed through the flow in your architecture in the live metrics feed.
Serverstring20This field should specify the server’s fully qualified name in the network.

If you’re using load balancing in front of multiple servers it’s relevant to monitor that they all perform equally well.

If one performs poorly, it may need maintenance or replacing, to the load balancer may be configured incorrectly.
Namestring20This is the name of the metric. Ideally, your solution architecture document includes a list of all metrics your design may emit.
Contextstring40This field can be used specify what the measured activity is working with. In my designs it’s most often the transaction identifier I assign to incoming requests.

If the metric is measuring moving a file via SFTP, it could be the file name.

If it’s a queue I’m counting the number of messages in, it would be the queue name.
Countinteger4If the metric counts anything, like the number messages currently found in a queue, it goes here.

If it doesn’t count anything in particular, I always set it to 1.

If my design incorporates channeling synchronous requests via queues to asynchronous backend services, I always add a retry feature, in which case I use the Count field in the metrics as the attempt count.
Starttimestamp6This is the start time. The developers will create a local variable in their code and assign the current time to it as the very first line of code in the service body.
Stoptimestamp6The stop time is set to the current time as late as possible, which means after the metric structure has been created with all the fields filled in (with the stop time as the last field to be filled in).

Naturally, the stop time is recorded before emitting the metric, which means that writing the metric to the designated queue or database table, and returning the service response aren’t included in the measurement.

This is OK because these two unmeasured steps remain constant, and the metrics are much more used to watch for trends.
Timeint4The time in milliseconds between Start and Stop. While this can be calculated when needed I have found that constantly calculating this value in dashboards, search filters etc. increases complexity and lowers performance to an extent that makes it worthwhile to just calculate when writing the metric to the database and accepting the extra few bytes of data consumption per metric.
Resultstring8This field tells if the measured activity was successful or not. This helps exclude activities that errored out, because they run for shorter times than successful activities, resulting in more accurate measurements of normal (successful) activity.
A typical metric requires an average of 238 bytes of storage, but you should adjust your space allocation to your specific solution architecture.

Including the result of the measured activity allows for richer analysis of the metrics. And it also helps testability as the metrics live-feed can reveal the results of test calls sooner that it takes to look in the logs. In my designs I usually implement the following Result values:

  • OK
    The measured activity was successful.
  • INVALID
    The received service request was found invalid. Distinguishing between errors and invalid requests helps in testing and monitoring. In testing, it indicates that any bugs should probably be found and fixed in the calling system. In operations, it helps system operators notify external systems of their problems with constructing valid requests.
  • RETRY
    The request was valid, but a temporary problem was encountered before the activity could be successfully completed, and it will be retried later. This is a great way to make the architecture self-recovering when external endpoints or database servers are temporarily unavailable. I always intreat the Count value before sticking the unfinished request back into the error queue. This way, when it is retried, the Count value shows the total number of times the operation has been attempted.
  • ERROR
    The request was valid, but an unrecoverable error was encountered. This happens when a called subsystem returns an error that can’t be retried, or when the last automatic retry was attempted without success.

Metrics are designed to be small enough to easily retain 3 to 5 years’ worth of data. But as the volume grows, analysis and reporting may slow down. Therefore, it can be worthwhile to aggregate metrics into supplementary data sets that can be feed into your dashboard and serve as content in reports.

I have good experiences with using elastic and Loggly to capture and analyze metrics. Some CIOs are finding that their pricing models are more appropriate for long-term storage of metrics data. Just as with logs, metrics can be sent to locally virtualized servers or to remotely hosted cloud services.

Metrics measure the activity in your architecture, which means that during idle periods no new metrics will feed into your dashboard and show you that the system remains healthy. So, let’s add heartbeats!

Heartbeats

I always add a heartbeat feature into my designs and set up a scheduler to call a heartbeat generate service, which then calls other services and subsystems to have them do the same. All services and subsystems then call a separate heartbeat update service to report their health, resulting in a chain of metrics that results in an x-ray of the system’s overall readiness.

Adding heartbeats to the collection system could look like this:

Again, you will be adding more code to your existing services (and introducing new services as well). Existing services need to be retrofitted to handle received heartbeat calls. They need to recognize the incoming service request as a heartbeat and skip the normal logic and do something else. That something else is usually to pass the heartbeat on to which ever other services or subsystems you normally call and calling the new HeartbeatUpdate service to report their health.

Heartbeats that propagate through services and subsystems to check the health of the business functionality are often set to happen every 30 minutes. This gives a reasonably short reaction time in case certain parts have stopped working, and it doesn’t flood the metrics storage unnecessarily.

I often make heavy use of queues between client-facing services and asynchronous backend services, because it contributes to the scalability and availability qualities of the architecture. With such a design practice, current queue utilization tells much about the level of activity in the system. Therefore, I schedule heartbeats that count the number of messages in all the system’s queues every 5 minutes and generate a metric for each queue that isn’t empty.

Surprisingly, this is often the type of metrics that product owners and SMEs find most interesting. It works really well in a dashboard that shows a constant bar chart of the counted messages and a pie chart showing the spread across the counted queues. It is visually satisfying to see how it builds up during peak hours, and gradually comes down once client systems become less active.

I have sometimes found it difficult to find a suitable scheduler in the companies I work for. Many of them still run legacy systems on mainframes, which traditionally have good schedulers running. But oftentimes these systems are categorized as ‘sunset’ technologies, and I’m therefore weary of relying on them for triggering heartbeats in my otherwise ‘modern’ Microservices-based architecture.

It can be an uphill battle to have a scheduling mechanism established in the ‘strategic SOA platform’ in some organizations. Sometimes I have been able to get the guys in the operations department (who still monitor technical metrics from servers) to set up scheduled calls to my heartbeat service.

Dashboards

In the good old days, the dashboard was usually a desktop application running on a more or less decommissioned workstation with a small wall mounted LDC screen. And since those makeshift setups were rarely maintained, they usually had to be rebooted every other morning by the first person coming into the office.

In some companies I have seen developers create makeshift dashboards feeding off of the raw business database tables to show current activity and error lists. These improvised setups are often thrown together to speed up testing, and better monitor the newly deployed system in the first few months of operation. But since it was never commissioned or sanctioned by the solution’s sponsor, it gets abandoned and often stop working – at least optimally after a while.

Now, if you send your metrics to systems like Splunk, elastic and Loggly, the dashboard is set up in metrics system , and is viewed as a webpage via the user’s browser.

The dashboard can show current and recent activity using bar and pie charts that are updated every 5 minutes. Service-oriented architectures lend themselves well to show incoming traffic and backend processing side by side, the latter being based on counting message queues between the client-facing services and their backend counterparts.

And a small ticker that lists encountered errors is also informative, even if you have designed your monitoring to automatically create support incidents when they occur.

Reports

If you design your metrics well, and retain them for a significant amount of time, you will find that you have tremendous flexibility in creating reports that convey the architecture’s performance and health over time.

I have learned that it’s difficult to predict exactly how such reports should be designed, and exactly which information they should convey – at least in the early pages of development.

But having access to all metrics from the system’s first day of operation means that you can develop the right reports later, when product owners and SMEs have more bandwidth to take it in.

Conclusion

Look at how the collection designs grows in size and complexity as logging, metrics and heartbeats are added. The number of code lines in each service body easily doubles. The number of services easily doubles. And the solution generates more data, which requires more technology and added to the long-term cost of ownership.

It’s no wonder that the ‘business’ frowns on your wanting to add observability qualities. It can be an uphill battle. But there are strong arguments in favor of doing it.

In ALL projects I have been involved in as a developer or an architect, where observability has not been built in from day one, it has been added later. Turns out the ‘business’ quickly tires of having to look inside the client or ERP systems to see what’s happening, and application maintenance staff quickly tires of not having any data to base their investigations on when trying to fix bugs.

Adding observability does come with a need for certain technologies, but most infrastructures already include them:

And observability greatly contributes to other design properties like audibility and operability, and helps you monitor the capacity, performance and availability qualities of your design.