British Airway’s IT Failure: A Lessons For All High Reliability Organisations

If any organisation should be able to claim that it is a High Reliability Organisation, it is British Airways. It fits all of the requirements in terms of both its organisation structure (complex, multi-stranded, high levels of mutual dependency between all of its myriad components) and its services (critical to all of its users, who are dependent on it to supply the services that it has promised).

And yet, the recent IT failure that affected its global operations but has subsequently been blamed on a single operator in a sub-contracted maintenance company, seems to break all of the rules of HRO management – with predictable results.

To recap the situation, in May this year, BA’s ticketing system suffered a total and catastrophic failure during the Spring bank holiday weekend, one of the busiest weekends of the year. That immediately affected the flights of over 75,000 passengers world-wide, but also created disruption across the BA system, as well as affecting other airlines which had passengers who were either transitioning to or arriving from BA flights.

Any international airline is dependent on a zero failure operating system, as even the smallest failure – booking systems not working, luggage handling failures, flight crews in the wrong place, refuelling not happening, food not being loaded in time – creates an instant chain of cascading consequences that have massive knock-on effects that quickly spread well beyond the original event.

They are dependent on systems that are, in a famous phrase associated with High Reliability Theory (HRT): ‘Systems that are not only foolproof, but damned foolproof’ (1). However, systems as complex and interdependent as an international airline, do not just happen. They are also not just an issue of correct design and professional management. They are in fact a reflection of the culture of the organisation.

Prior to the development of High Reliability Theory, the main academic school of thought concerning highly complex organisations was Normal Accident Theory, described by Perrow (2) in his study of the Three Mile Island nuclear disaster, and which he then developed as a general theory of systems that stated that the more complex an organisation was, the higher the likelihood of failure. When that complexity involved two or more systems that were themselves complex, interacting to create an even higher order of complexity, then the likelihood of failure was almost certain. In this sense, failure was not a non-normal state that should be considered as an unexpected event, but was in fact an inevitable consequence of the complexity of the systems themselves.

One of the founders of the study of High Reliability Theory, the creation of operation management programmes that are fail-safe, identified the foundation of high reliability as not being an issue of technical management, but of ‘mindfulness’ (3). This stated that it was a sensitivity to the possibility of failure, and an organisational commitment, at every level , to preventing even the possibility of failure, that was the underlying quality on which all high reliability organisations depended.

One of the fundamental beliefs of Weick’s model of high reliability was that there is no such thing as an insignificant problem. All problems are themselves indicative of an underlying fault or weakness, a significant failing that allowed that problem to develop. If that systemic fault or weakness had not been there, then the problem itself would have been identified and dealt with before it could become a problem.

Within a complex operation, the conclusion from the previous assumption is that the problem itself is not ‘an event’, but only the final visible part of a ‘long-chained causal process’, where multiple other failings had to take place to allow the conditions to rise that enabled that incident to occur.

The final part of this piece of HRT states that not only is each incident important in itself, but they should be treated as though they are indicators of a potential crisis. This means that they are in fact not only significant, but are warnings about underlying problems that could at any time escalate into an actual crisis situation.

To return to the BA incident, it is clear that the dependency on the IT system (and that could be anything to do with technological support, not just the ticketing system), should have been triple locked in terms of technical safety, but also triple-locked in terms of management attitudes and sensitivity to anything that could possibly have been seen as something that could lead to systems failure.

It is alarming therefore, to read in this morning’s newspaper that BA would not be publishing a review into its IT failure, and that Willie Walsh, chief executive officer of AIG the parent company that owned British Airways, stated that the IT failure was an ‘isolated incident’. This is despite the fact that BA had suffered at least four computer-related failures in the last year, and that BA chief executive officer Alex Cruz had gone on record last year when he told investors that the airlines IT systems were experiencing regular problems.

The ability to understand, and then correctly manage, complexity is an integral part of a management position. The higher the position, and the more complex the operation, the greater the responsibility. This incident happened to British Airways, but it is likely that every High Reliability Organisation is vulnerable to similar issues. For those interested in other examples of high reliability organisations losing the cultural commitment to excellence and zero-failure operating environments which then lead to catastrophic failures, case studies include the NASA Challenger and Colombia disasters, BP and the Deepwater Horizon failures, or the problems associated with multiple financial institution IT failures that we have seen over recent years.

Sources

  • Schulman, P. R. (2004). General attributes of safe organisations. Quality and Safety in Health Care, 13(suppl 2), ii39-ii44.
  • PERROW, Charles. (1984). Normal Accidents: Living with High Risk Technologies Princeton University Press,
  • Weick, K. E., Sutcliffe, K. M., & Obstfeld, D. (1999). Organizing for high reliability: Processes of collective mindfulness. Crisis management, 3, 81-123.

See also my article in Risk UK Magazine:

High Reliability Organisations: The New ‘Buzz Phrase’ For Business Management? Dec 2015, pp 50-51
High Reliability Organisations: A Model For Effective Risk Management Risk UK, January 2016, pp 49-50

Deltar Ofqual-Regulated Training Programmes

For information , please visit

3-day Ofqual-regulated Level 5 Management Award in Corporate Risk and Crisis Management
12-month distance-learning Ofqual-regulated Level 6 Diploma in Strategic Risk and Crisis Management