Recoverability is one of those areas no one seems particularly interested in until the first breakdown occurs. Then everyone acts extremely surprised when the solution turns out to be difficult to salvage lost data and bring the system back in operation.
And there are rarely any formal requirements to help you prioritize this design goal, so you may have to do battle to get it prioritized early enough in the design and building phase to lay the groundwork for good recoverability.
I recommend starting a list in this chapter from day one, pinpointing every conceivable threat to the system’s continuous operation, describing the potential fallout of each interruption, and outlining steps to recover.
This work can be time-consuming and will often require frequent involvement from lead developers who know the underlying technologies best.
You will also often need to add extra code that implements mechanisms that support recovery activities. And define design principles that steer the design and coding work in that towards this goal.
I have learned that widespread use of message passing via queues can improve the recoverability of my designs.
But I also find that maintaining an inventory in a database table of transactions initiated by external systems, and keeping track of their state (position in their designated flow) helps recover from a crash.