The outages at RBS, TSB and Visa left millions of people unable to deposit their paychecks, pay their bills, acquire new loans and more. As a result, the House of Commons’ Treasury Select Committee (TSC) began an investigation of the U.K. finance industry and found the “current level of financial services IT failures is unacceptable.” Following this, the Bank of England (BoE), Prudential Regulation Authority (PRA) and Financial Conduct Authority (FCA) decided to take action and set a standard for operational resiliency.
While policies can often feel burdensome and detached from reality, these guidelines are reasonable steps that any company across any industry can exercise to improve the resilience of their software systems.
The BoE standard breaks down to these five steps:
- Identify critical business services based on those that end users rely on most.
- Set a tolerance level for the amount of outage time during an incident that is acceptable for that service, based on what utility the service provides.
- Test if the firm is able to stay within that acceptable period of time during real-life scenarios.
- Involve management in the reporting and sign-off of these thresholds and tests.
- Take action to improve resiliency against the different scenarios where feasible.
Following this process aligns with best practices in architecting resilient systems. Let’s break each of these steps down and discuss how chaos engineering can help.
Identify critical business services
The operational resilience framework recommends focusing on the services that serve external customers. While internal applications are important for productivity, this customer-first mentality is sound advice for determining a starting place for reliability efforts. While it’s ultimately up to the business to weigh the criticality of the different services they offer, the ones necessary to make payments, retrieve payments, investing or insuring against risks are all recommended priorities.