All organizations face an increasing array of disruptions that can threaten their ability to operate effectively. From cyberattacks to natural disasters, the risks are manifold, and the consequences of service interruptions can be severe. This is where operational resilience comes into play, particularly within technology operations. This article aims to provide an exploration of operational resilience, its importance, and how organizations can effectively manage it.

What is Operational Resilience?

Operational resilience goes beyond simply recovering from incidents. It is a proactive approach to managing risks and ensuring the continuity of critical services, even when faced with significant disruptions. This concept encompasses a multifaceted framework designed to anticipate potential disruptions, implement preventive measures, respond effectively to incidents, and recover swiftly to a state of normal operation. An effective operational resilience strategy involves four main areas.

Anticipation

This involves identifying and assessing potential risks that could disrupt operations. This includes both internal risks (such as system failures or human error) and external risks (such as natural disasters or cyberattacks). By anticipating these risks, organizations can develop plans to mitigate their impact.

Prevention

Once potential risks have been identified, organizations can implement preventive measures to reduce the likelihood of disruptions. This might include implementing redundant systems, conducting regular maintenance, and providing employee training.

Response

Despite preventive measures, disruptions can still occur. A strong operational resilience strategy includes plans for responding effectively to incidents. This might involve activating backup systems, communicating with stakeholders, and mobilizing resources to address the disruption.

Recovery

The final stage of operational resilience is recovery, which involves returning to normal operations as quickly and efficiently as possible. This might include restoring systems, repairing damage, and learning from the incident to prevent future disruptions.

The Growing Importance of Operational Resilience

A multitude of factors have contributed to the increase in board-level and stakeholder attention to operational resilience. As organizations expand their operations across borders and time zones through globalization, they become exposed to a wider array of potential disruptions, such as political instability, economic fluctuations, and natural disasters. This heightened vulnerability has made it imperative for organizations to prioritize operational resilience to ensure they can withstand and recover from such events.

Furthermore, the pervasive reliance on technology in virtually every aspect of business operations has amplified the importance of operational resilience. Any disruption to critical technology systems can have a cascading effect, leading to significant financial losses, reputational damage, and operational paralysis. Consequently, organizations must ensure the resilience of their technology infrastructure to maintain uninterrupted business operations and mitigate the risks associated with technology failures.

In addition to these internal and external pressures, regulatory bodies are increasingly mandating that organizations demonstrate a high degree of operational resilience within their industries but also across supply chains where concentration risk may occur. This regulatory scrutiny has further compelled boards to prioritize operational resilience and implement robust frameworks to comply with regulatory requirements and avoid potential penalties.

Key Aspects of Operational Resilience

Operational resilience requires a proactive approach to risk management, rather than simply reacting to disruptions. Organizations need to anticipate potential disruptions before they occur and take steps to mitigate their impact. This can be achieved by leveraging existing risk management frameworks instead of creating new and separate ones, allowing for the integration of operational resilience into the overall risk management strategy and enhancing existing processes.

Adaptability is also key to operational resilience. The ability to adapt to changing environments, emerging threats, and new technologies is essential for organizations to remain resilient in the face of an ever-changing landscape. This involves not only adapting to internal changes but also being prepared for external disruptions that may arise. By remaining adaptable and implementing proactive risk management strategies, organizations can position themselves to respond effectively to any disruptions that may occur, ensuring the continuity of critical business services and maintaining a competitive advantage.

Operational Resilience Frameworks

To effectively manage operational resilience, organizations need to establish frameworks that define how they will identify, assess, and manage risks. These frameworks need to clearly define what operational resilience means for the organization and its core attributes; establish clear roles and responsibilities for operational resilience; and assign ownership of operational resilience to specific individuals or teams.

Organizational resilience is the ability to adapt to changing environments. Operational resilience, specifically, is the capacity to deliver critical operations during disruptions. Two key definitions that need to be established and clearly understood by everyone involved are Critical Business Services and Impact Tolerance.

Critical Business Services (CBS)

Critical Business Services are those core functions and offerings that directly contribute to an organization’s ability to generate revenue, maintain customer relationships, and uphold its reputation. These services are so critical that any disruption or degradation could lead to substantial financial losses, regulatory penalties, reputational damage, or even threaten the organization’s continued existence.

Identifying and prioritizing CBS is a fundamental step in operational resilience planning. This involves a thorough assessment of all business services to determine their potential impact on the organization in the event of disruption. Examples of CBS can vary depending on the industry and specific business model. For a financial institution, payment processing, customer account management, and trade execution might be considered CBS. In a manufacturing company, supply chain management, production, and order fulfillment could be classified as CBS. While in a technology company, software development, cloud services, and customer support might be deemed CBS.

CBS are often interconnected and dependent on other business services, as well as underlying technology infrastructure and third-party providers. Understanding these dependencies is crucial for effective operational resilience management.

Impact Tolerance

Impact Tolerance refers to the maximum level of disruption or degradation that a CBS can withstand before causing unacceptable harm to the organization. It represents the organization’s ability to absorb and recover from operational disruptions without experiencing significant negative consequences.

Impact Tolerance can be measured and assessed in various ways, depending on the organization’s risk appetite. Common metrics include Recovery Time Objective (RTO), Recovery Point Objective (RPO), Maximum Tolerable Downtime (MTD), and Maximum Tolerable Data Loss (MTDL).

Several factors can influence the Impact Tolerance, including the criticality of the service, the availability of redundant systems or backup procedures, the effectiveness of incident response and recovery plans, and the organization’s overall risk management strategy.

By understanding the Impact Tolerance of each CBS, organizations can develop targeted strategies to enhance their resilience and minimize the potential impact of operational disruptions.

Governance and Ownership

Effective governance and ownership play a crucial role in achieving operational resilience. Boards hold the ultimate responsibility for operational resilience within the organization. Their duties include approving strategies, identifying important business services, and setting impact tolerances. Meanwhile, Senior Management is tasked with implementing the operational resilience framework and overseeing its effectiveness.

Governance structures must be designed to address key elements of operational resilience. This includes service identification, impact tolerance, ongoing monitoring, and reporting. If they exist within the organization, Operational Risk Management teams serve a vital function by supporting the Board and overseeing business managers in their resilience efforts. This collaborative approach ensures that all levels of the organization are aligned and actively working towards maintaining operational resilience.

Integrating Operational Resilience into Risk Management

Operational resilience needs to be integrated into existing risk management frameworks rather than treated as a separate initiative to make it fully effective. This approach embeds resilience into the overall risk management strategy of the organization and ensures that it is not a silo that is disconnected from other operational risk activities.

Integrating operational resilience into existing risk management frameworks offers several benefits. It avoids complexity by preventing the creation of redundant processes and systems, which can lead to confusion and inefficiency. Additionally, integration enhances adaptability, allowing for swift and effective responses to environmental changes. By leveraging existing resources and expertise, integration also streamlines efforts, making the most of the organization’s existing risk management capabilities, knowledge base and tools.

Operational Resilience Process Overview

The process for identifying Critical Business Services (CBSs) starts with a clear definition. Something like; a business service provided to an external end user, deemed important if its disruption impacts viability or customer harm. While the Board is responsible for (or may delegate) identifying CBS, the Chief Operations Officer and Chief Risk Officer are responsible for conducting assessments. 

Collaboration between risk, operational, and business areas is essential. The process should start by listing services and mapping processes to identify key operational risks. Sources such as risk registers, critical asset lists, and business continuity plans can be used to identify services.

Measuring service impact should leverage metrics aligned with strategic objectives and risk appetite. Examples could include client harm, volume, financial impact, and regulatory impact with the focus being on using fewer, better measures than going for quantity that would be costly to collect and report on and may result in confusion about what is really the most material impact areas.

Setting impact tolerances involves defining acceptable levels of service disruption, expressed in service outage time and other metrics. This responsibility falls on the Board for setting limits as part of their normal mandate for establishing risk appetite levels or could be delegated to a committee of the Board to establish, oversee and report. Techniques such as historical data analysis and benchmarking against industry standards can be used.

Scenario Testing needs to be done to test the organization’s ability to respond to operational risk events and it should be proportional to the organization’s size and complexity and use realistic assumptions. This scenario testing and its resulting reports are usually evidence that auditors and regulators are increasingly expecting to see. 

Operational Resilience in Technology Operations

Technology as the critical enabler for many business services, means its resilience is paramount. Ensuring operational resilience within technology operations requires specific requirements and practices each of which will play a part in the impact tolerance setting and the strategies under each critical business service.

Robust Infrastructure

Building a resilient technology infrastructure involves creating redundancies and failover systems. This means that if one system fails, another can take over seamlessly, minimizing downtime. Disaster recovery plans are essential for restoring operations after a major disruption.

Cybersecurity Measures

Cybersecurity is a critical component of operational resilience. Organizations must protect their technology systems from cyberattacks that can disrupt operations. This includes regular vulnerability assessments, penetration testing, and implementing strong security controls.

Data Backup and Recovery

Data is the lifeblood of most organizations. Regular data backups and a reliable recovery process are crucial for restoring data and systems in case of failures or attacks. Organizations should have a clear strategy for data backup, storage, and recovery.

Incident Response Plan

A detailed incident response plan outlines how the organization will respond to technology disruptions. It should include clear roles, responsibilities, and communication protocols. Regular testing of the incident response plan is essential.

Monitoring and Alerting

Comprehensive monitoring tools can detect anomalies and potential disruptions before they cause significant problems. Alerts should be set up to notify the appropriate personnel of any issues.

Change Management

A controlled change management process minimizes risks associated with technology updates and deployments. Changes should be planned, tested, and implemented carefully to avoid disruptions.

Vendor Management

Many organizations rely on third-party technology providers. It is essential to ensure that these vendors also have strong operational resilience practices and agreements in place.

Testing and Validation

Regular testing of disaster recovery plans and incident response procedures is crucial for validating their effectiveness. These tests should simulate real-world scenarios to identify any weaknesses.

Capacity Planning

Planning for adequate capacity is essential to handle peak loads and potential surges in demand. This prevents service degradation and ensures that technology systems can handle increased usage.

Automation

Automation can reduce manual errors and speed up recovery processes. Automating tasks such as backups, patching, and failovers can improve efficiency and resilience.

Key Takeaways

  • Operational resilience is about ensuring business continuity despite disruptions.
  • It involves anticipation, prevention, response, and recovery.
  • Globalization, technology reliance, and regulations drive its importance.
  • Adaptability and proactive risk management are essential.
  • Organizations need frameworks for identifying Critical Business Service and setting impact tolerance levels.
  • Governance and ownership are crucial for effective operational resilience program implementation.
  • Operational resilience should be integrated into existing risk management.
  • Technology operations require robust infrastructure, cybersecurity, and incident response plans to support operational resilience.
  • Testing, monitoring, and automation are key controls to enhance resilience in technology.