Observability

Along with non-repudiation, observability(* is one of those often overlooked design goals, and while stakeholders can buy into ensuring non-repudiation you can expect keen resistance to you ambitions for observability.

*) A note about linking to the Wikipedia definition – normally, such articles discuss IT related definitions, but in this case the definition is related to mathematics. However, that definition fits really well nonetheless: ‘… observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs’.

The challenge is two-fold:

  • Observability in itself doesn’t s directly support the day-to-day business, so it’s difficult to convince SMEs, BAs and PMs that it’s worthwhile.
  • Achieving observability requires a lot of extra coding, and supporting technologies, all of which increases the development time and costs as well as the long-term cost of ownership.

So, why even bother? Because without measurability no one will know how the solutions holds up with regards to capacity and performance once it’s in production.

Instead, hitting the capacity and performance ceiling leads to breakdowns that are costly to recover from.

Establishing good observability enables the solution’s product owner to analyze metrics, spotting trends that reveal future problems, which can then be addressed well before the solution breaks down.

Maintaining uninterrupted operation outweighs the cost of adding observability to the design.

Luckily, observability also contributes to improved testability and operability, which can reduce development time, and time spent on normal operation, lowering the cost of ownership.

Server monitoring tools only scratches the surface, and relying on raw logging data is complex and requires maintenance when log records change structure due to code changes in maintenance releases.

The only way to achieve real meaningful observability is to build in code that records the start and stop times of all activities, and write them as metrics to a database for later analysis.

Using this technique you are able to match the metrics against your capacity and performance goals.

I recommend that you create a list of metrics with the following columns:

MetricDescription
CreateCustomerOuter metric measuring the time it takes to validate a client system request, and queue up the request for asynchronous backend processing.
CreateCustomerProcessOuter metric measuring the time it takes to create a customer.
PublishInner metric measuring the time it takes to call the standard publisher service.
I distinguish between outer and inner metrics. Outer metrics encompass the entire service body. Inner metrics are used to measure calls made to other services or subsystems that aren’t emitting their own metrics.

To ensure that some metrics are generated even during inactivity, scheduled heartbeats can be generated to activate the system at regular intervals.

The live stream of metrics emitted by the active parts of your system can also be used to set up a live dashboard that gives SMEs insights into the current activity in and health of the solution.