mon is a monitoring application that helps me achieve observability in my other apps. It also serves to demonstrate that concept of observability, which is a key design concern I always address in solution architectures.
My server hosting provider collects a few technical metrics:
- Number of requests to the server (as a whole and per subdomain)
- List of requested URLs and the returned HTTP status
- Disk usage
While not entirely irrelevant, this data doesn’t tell me much about what’s going on in the the apps I’ve deployed on the server. The rare HTTP errors returned don’t reveal errors in the apps, just connectivity errors and unrecognized URLs. And the data doesn’t reveal how any of the apps perform.
Therefore, I’m retrofitting my apps with observability features. They measure key activities as metrics and write them to the server along with log entries that tell me more about the measured activities. This helps me correlate the otherwise incomprehensible request logs with actual activity that I can trace to the source code of my apps.
In a client solution I’d typically recommend writing logs and metrics to ElasticSearch (ELK) but my hosting provider doesn’t support ELK, so I’ve developed mon and some simple C# classes to generate logs and metrics from my other apps to demonstrate the concept.
To see it working, toy around with uml, and then start mon to see what was logged and measured.
Design notes
While database technologies exist that easily allow pouring both logs and metrics into the same sink, my hosting provider doesn’t support them (my provider’s primary customer segment being website owners). Therefore, I’m targeting a MySQL instance on my provider’s database server farm using .NET Core EntityFramework classes.
The LogDb builds on DbContext, adding connection pooling and using either in-memory, SQL Server, MySQL or Sqlite as storage.
The LogEntity and MetricEntity classes define their respective table structures, and are also used as private state properties by the Log and Metric classes.
The Log and Metric classes are fluent builders of their respective entities. Both require the use of their static methods From, which ensures getting both properly initialized with mandatory values. From and other methods return the Log or Metric object, allowing me to chain method calls to fluently set other values as well.
The Write method is called last to save the Log og Metric to the database, which for metrics includes setting the Stop and Elapsed values as well.
Fields
Field | Type | Log | Metric | Description |
---|---|---|---|---|
Id | string | X | X | Assigned a new GUID in the From method. |
Time | DateTime | X | Assigned DateTime.Now in the From method. | |
Parent | string | X | Optional, defaults to null. Can be assigned the Id of a parent metric, making this metric a child metric. Can be used to measure parts of a larger whole, for example the processing of each item in a collection. | |
Start | DateTime | X | Assigned a value passed as an argument in the call to From. Typically, a DateTime variable (named start) is assigned DateTime.Now and passed in the call to From. | |
Stop | DateTime | X | Assigned a value in the Write method. | |
App | string | X | X | Assigned a value passed as an argument in the call to From. All logs and metrics from one application should specify the same well-recognized value. |
Where | string | X | X | Assigned a value passed as an argument in the call to From. Specify a value that helps pinpoint from where in the source code the log or metric i generated. It often makes sense to provide a name consisting of a class and method name. |
Session | string | X | X | Assigned a value passed as an argument in the call to From. Specify a value that remains the same throughout a single user (or client system) interaction. In web apps, generate and persist a session id (as a new GUID), and pass it to all logs and metrics. In API apps, issue a session id when the caller is authenticated (and is given an access token), and demand that subsequent API calls take the issued session id as an argument. |
Text | string | X | X | Optional, defaults to null. Provide a brief description of what’s being logged or measured. The text should make sense to normal day-to-day users but also help a support technician or developer remember what the app is doing where the log or metric is generated. |
Status | int | X | X | Defaults to 0, which means Ok. Call the appropriate status method for changing the status. Log offers Ok, Warning and Error methods. Metric offers Ok, Denied, Blocked and Failed methods. |
StatusText | string | X | X | Returns a user-friendly text representation of the Status value. The returned StatusText value matches the status method name called to set the Status value. |
Details | string | X | Optional, defaults to null. Pass any serializable object that contains details pertinent to the logged activity and its outcome. The object is stored as an easy-to-read JSON string. | |
Reason | string | X | Optional, defaults to null. Typically only given a user-friendly reason for a Denied, Blocked or Failed outcome of a measured activity. | |
Elapsed | long | X | The timespan between Start and Stop in milliseconds. Calculated in the Write method. | |
ElapsedText | string | X | Returns a user-friendly text representation of the Elapsed value. Long timespans are summarized in seconds, minutes, hours, days, months and years. | |
Count | int | X | Defaults to 0. Pass a relevant value to provide context to the outcome of the measured activity. For example, when processing a collection of items, assign Count the number of items. | |
Size | long | X | Defaults to 0. Pass a relevant value to provide context to the outcome of the measured activity. For example, when reading a file, specify the file size. | |
Retry | int | X | Defaults to 0. Increase and pass a retry attempt count every time a previous attempt was Blocked. Typically only used in larger service-oriented architectures where for example microservices encounters timeouts when calling other services and therefore park their request in a queue from where it can be retrieved and retried later. | |
Environment | string | X | X | The value of the ENVIRONMENT system environment variable. Is null if not set. Depending on the machine’s operating system, such an environment variable can be set in various different ways, including in configuration files read by the executing app, which generates the logs and metrics. |
Machine | string | X | X | The machine name given in the configuration of the machine’s operating system. Often equivalent the the machine’s name in the local network, which if pinged yields its IP-address. |
User | string | X | X | The user account under which the app that generates the logs and metrics runs. Typically a user account reserved for running the app on the server, but when running the app locally during development and testing it will be set to the logged-in user’s log-in name. |
Example code
The uml app’s Program.cs file starts like this:
using log;
var start = DateTime.Now;
const string mon = "uml";
const string where = "Program.Main";
var session = Guid.NewGuid().ToString();
LogDb.UseMySql( "insert connection string here" );
LogDb.Connection().Database.EnsureCreated();
Log.From( mon, where, session )
.Text( "Starting webapp" )
.Details( new
{
Platform = Environment.OSVersion.VersionString,
Processors = Environment.ProcessorCount,
Environment.ProcessId,
ThreadId = Environment.CurrentManagedThreadId
} )
.Write();
The above code saves the program’s start time for later, and logs that it has started. I’m sticking an anonymous object in the Details method to log a little information about the server.
Before starting the web app the following code registers handlers for the ApplicationStarted and ApplicationStopped events:
app.Lifetime.ApplicationStarted.Register( () =>
{
Log.From( mon, where, session )
.Text( "Webapp started" )
.Write();
Metric.From( mon, where, session, start )
.Text( "Webapp startup" )
.Write();
} );
app.Lifetime.ApplicationStopped.Register( () =>
{
Log.From( mon, where, session )
.Text( "Webapp stopped" )
.Write();
Metric.From( mon, where, session, start )
.Text( "Webapp lifetime" )
.Write();
} );
app.Run();
The ApplicationStarted handler is called right before the apps starts taking HTTP calls from users’ browsers. So emitting a metric here measures how much time is spent on opening the connection to the database and loading data.
If the web app is shut down (stopped) properly, the ApplicationStopped handler is called right before the program exits. I’m using this to emit a metric, which records the app’s up-time. This helps me learn how often my hosting provider takes down my server, and if it happens at a convenient time of the day.
The uml app’s JavaScript starts by generating a session id, which is persisted in the session cookie to survive browser refreshes:
function onLoad() {
session = window.sessionStorage.getItem( "session" );
if( session == null )
{
session = uuidv4();
window.sessionStorage.setItem( "session", session );
}
...
A convenient JavaScript function helps write log messages to an API controller:
function log( message ) {
fetch(location.href + "api/Log", {
method: "POST",
headers: {
"Accept": "application/json",
"Content-Type": "application/json"
},
body: JSON.stringify( { session : session, message : message } )
} )
.finally();
}
The C# API controller uses the C# log and metric classes:
[ Route( "api/Log" ) ]
[ ApiController ]
public class LogController : ControllerBase
{
public class Args
{
public string session { get; set; }
public string message { get; set; }
}
[HttpPost]
public IActionResult OnPost( Args args )
{
Log.From( "uml", "LogController.OnPost", args.session )
.Text( args.message )
.Write();
return Ok();
}
}
uml only measures one operation, namely the call to the publicly exhibited PlantUML rendering service:
[HttpPost]
public IActionResult OnPost( Args args )
{
var start = DateTime.Now;
var log = Log.From( "uml", "RenderController.OnPost", args.session )
.Text( "Render image button clicked" );
var metric = Metric.From( "uml", "RenderController.OnPost", args.session, start )
.Text( "PlantUML Renderer service called" )
.Count( args.text.Length );
try
{
var bytes = Renderer.Render( args.text, OutputFormat.Png );
log.Write();
metric.Size( bytes.Length )
.Write();
return Ok( Convert.ToBase64String( bytes ) );
}
catch( Exception e )
{
log.Warning()
.Details( e )
.Write();
metric.Failed()
.Reason( e.Message )
.Write();
return BadRequest();
}
}
Here’s an example from the erp application using the metric object to write log entities, which reuses the values from the Metric.From call:
// Register new user by creating email and user entities in the database.
public static Result RegisterUser( string sessionId, string name, string email, string password )
{
// Start metric and log the service request.
var metric = Metric.From( Program.Erp, "AccessService.RegisterUser", sessionId );
metric.Log().Details( new { name, email, password } ).Write();
// Assume success.
var result = Result.Ok;
try
{
// Pick a database connection from the connection pool.
using var connection = ErpDb.Connection();
// Create a user entity encrypting the password.
var userEntity = new UserEntity { Id = Guid.NewGuid().ToString(), Name = name, HashedPassword = HashPassword( password ) };
connection.UserEntities.Add( userEntity );
// Create an email entity with a reference to the newly created user.
var emailEntity = new EmailEntity { Id = Guid.NewGuid().ToString(), Email = email, UserId = userEntity.Id };
connection.EmailEntities.Add( emailEntity );
// Commit the entities to the database.
connection.SaveChanges();
// Log the success and the id's assigned to the newly created user and email entities.
metric.Log().Details( new { UserId = userEntity.Id, EmailId = emailEntity.Id } ).Text( "User and email created" ).Write();
}
catch ( Exception e )
{
// Log failure.
result = Result.EmailAlreadyRegistered;
metric.Failed().Reason( result.ToString() ).Log().Text( "Failed to create user and/or email" ).Details( e ).Error().Write();
}
// Stop and write the metric, and return the result.
metric.Write();
return result;
}
This demonstrates how to declare Log and Metric variables, and not Write them right away. This holds them open for further modification. For example, after a successful rendering attempt the size of the returned image is added to the Metric. In case the rendering fails, both the Log and Metric statuses are updated, the exception is passed in its entirety to the Log object and its exception message passed to the Metric object’s Reason method.
Design principles for logging
My experience is that developers are very bad at putting themselves in the users’ shoes, and their first take on logging is insufficient. They either log too little (and never metrics), or the log data is full of cryptic codes and auto-generated texts that only makes sense to the developers themselves.
I have tried – and failed – to foresee which logs and metrics would be appropriate, so I’ve given up trying to specify them in formal requirements and use case specifications.
In stead I recommend enforcing the following design principles. Then get into the habit of observing the logs and metrics feed when developers demonstrate newly build features. After each demonstration, walk through the captured logs and metrics, and suggest/demand changes – and follow up afterwards to ensure they are made.
- Avoid sensitive information in metrics so you don’t need to restrict access to them. The more users who follow the metrics feed the more users can actively suggest improvements and detect lurking problems that can be addressed before they lead to disruption.
- Keep metrics and logs separate so you can restrict access to logs without having to impose the same restrictions on access to metrics. This also helps differentiating retention. You want to keep metrics much longer than logs.
- Assign session ids to user sessions, and reference the session ids in all logs and metrics generated throughout each session. This helps analyze the user experience, and pinpoint logs and metrics leading up to a support incident.
- Log all activity – also measured activity that emits aa metric. It shouldn’t be necessary to switch between viewing logs and metrics to get the full picture of what has happened in the system. Metrics supplement logs with measurements that help observe capacity and performance over time.
- Measure activities that read or write files. If file sizes suddenly increase significantly, or if disks suddenly become slower, it’s good to have metrics that reveal the reason for the increase in elapsed time.
- Measure activities that iterate over potentially large collections of items. If item counts start to increase it’s good to be able to account for the increase in elapsed time.
- Measure the processing of each item in a collection with a metric that refers to the parent metric measuring the entire collection processing. This helps root out if there are certain types of items that takes significantly longer to process.
- Measure activities that involve calling services over network connections. Connections may suddenly become unstable, services may suddenly start responding much slower than before, and it’s good to be able to account for how how of the overall elapsed time was down to waiting for the responses from called services.
- Aim for a metrics feed that mirror the users’ perception of the flow of business events. That way they are more likely to detect anomalies in flows, which they can address themselves without having to raise a support incident.