Just a few weeks ago, Brussels airspace was closed for several hours following a technical problem. Labelled “a disaster” for Brussels Airport, the incident saw flights cancelled, delayed or diverted and passengers stranded. Unfortunately, this isn’t an isolated event and in fact they’re on the rise.
TSB and Visa have both recently suffered technical downtime which left customers unable to access their accounts or make payments. Last year, British Airways had not one but two system failures that saw 75,000 passengers grounded in the first instance, with head of parent company IAG admitting it was “damaging to our reputation”. The trouble didn’t end there with another two system failures this year causing “chaos” at Heathrow. After vowing “never again” following the first incident, what is it that keeps going wrong and what should these companies be doing to avoid it happening again?
A common theme between these companies is that they all have big, complex IT systems. Because of that and their reliance on these systems, they should have processes in place, which mean that when they have an issue within their IT systems, the impact on operations is minimised and, perhaps more importantly, when the systems are brought back up following any outage, they are focussed on solving the operational carnage and reputational damage that their outage caused, rather just simply just picking up where they left off. Processes such as – resilient IT, crisis management, disaster recovery (DR) and business continuity. The likelihood is that companies do have these processes in place so why aren’t they working?
A quick analysis of recent airline IT-related disasters shows that an outage of a mere 30 minutes on an essential IT system is more than enough to cause a newsworthy operational knock-on effect. This time criticality has, quite rightly, prompted the airline industry to improve the resilience of its IT by investing in high availability (HA) systems to minimise the chance of operational disruption. Its complex IT systems are frequently designed to provide eye-watering application uptimes on a daily basis by being “fault tolerant”, which is achieved by duplication and redundancy in the technology systems and/or, in extremis, “fail-over” to another solution.
However, whilst an HA system works to ensure the business as usual availability of information and technology services by making the IT system more resilient to faults and component failures, it does not provide the means to recover information technology services (e.g. infrastructure, telecoms, systems applications, data and Service Desk etc.) in extreme cases of downtime/disruption.
In short, an HA solution without an associated IT DR solution is just asking for trouble.
But beware – although disaster recoveries can now happen very quickly, in businesses such as airlines where severe operational impacts are almost instantaneous, a DR solution designed solely as a backstop for a business as usual HA solution will provide very little value to a disrupted business attempting to sort out operational chaos.
I don’t know if Brussels Airport or British Airways, or indeed TSB and Visa, made these mistakes. But I do know that HA without a DR solution, that has been defined by the business continuity and crisis management needs of the business, will result in acute failure once the IT resilience afforded by the HA is overwhelmed. This is then followed by a prolonged and disproportionate operational impact as the first systems to be recovered will be those that are needed for normal operation and to sort out operational backlogs.
What can businesses do?
It is tempting to think that ensuring the resilience or continuity of all the individual parts of a business will guarantee the resilience or continuity of the whole. However, as the airline examples demonstrate, this is not necessarily the case.
Whilst it makes perfect sense that each element of the business (e.g. IT, Operations, Finance, Marketing etc.) are resilient in their own right against low impact, high probability risks by using high availability techniques, they need to adopt the tried and tested business continuity and organisational resilience approach for high impact risks.
This is particularly important when thinking about IT. Disaster Recovery capabilities need to focus on the recovery of the business rather than recovery of the IT system. Only then will it be able to minimise the impact to the citizen or the customer.
These types and scale of incidents just shouldn’t be happening today. The technology and expertise are out there to ensure that if an IT issue strikes, it doesn’t completely cripple the company both from a logistical point of view on the day but also with regards to reputation. It’s been a while since British Airways has been known as “the world’s favourite airline” and the latest spate of technical failures won’t be helping that reduction in confidence. As our reliance on technology continues to grow, businesses from all industries need to ensure that they fully understand the difference between high availability and disaster recovery and that disaster recovery is of no use unless it recovers the things that a business needs in the immediate aftermath of a disaster.
Sandra is a seasoned risk and business continuity professional with over 25 years’ experience of the design and management of risk, continuity and security solutions in public and private sectors. Her experience ranges from the management of risk and the protection of data within the UK’s main business process outsourcer to the financial and government sectors; to the protection of the largest part of the UK’s critical national infrastructure. She also has first-hand experience of managing incident response and disaster recovery activities from the management of high profile DDoS attacks to severe flooding incidents.