“They say, timing is everything. But then they say, there is never a perfect time for anything.” These words from U.S. author Anthony Liccione would have struck home to many British Airways passengers travelling from London last weekend. Their timing could not have been worse.
After its IT systems failed, the airline suffered a major power outage that left more than 75,000 passengers stranded for days at Heathrow and Gatwick airports and clocked up a reported $150 million in compensation costs.
Over the ensuing week, the company has attempted to salvage its reputation after heavy criticism of its crisis management plans, which appeared to constitute little more than providing yoga mats to stranded passengers.
British Airways eventually released a statement that said: “There was a loss of power to the UK data center which was compounded by the uncontrolled return of power which caused a power surge taking out our IT systems. So we know what happened, we just need to find out why.”
For many in the IT world, these words were risible. It seems that BA’s lack of vision with its yoga matt distribution crisis management ‘plan’ was mirrored in its approach to its IT systems.
Much has been made of the company’s decision to lay off hundreds of IT workers in its software department last year and outsource many of these jobs to India. The problem, however, may be more fundamental; a dated infrastructure that needs rapid updating.
Traditional data centers take a huge amount of time, cost and effort to setup and maintain, with contracts, procedures and the purchase of connectivity and hardware all taking months or years to come together.
As a result, companies such as BA are unlikely to have a presence in small data centers and tend to opt towards large physical installations. It appears that both BA’s primary and backup systems were in the same building.
But when they go wrong, the all-eggs-in-one-basket strategy falls down as quickly as the IT systems they are supposed to support. If BA wants to ensure that “nothing like this happens again,” the company should think of moving to the cloud as quickly as possible.
In the cloud, it is very fast to build services in a wide range of physical locations. A cloud data center can fail just like any other, but there is no upfront cost for companies who want a presence in multiple regions.
Consequently, it is easy to design systems which are active in many regions at the same time, so when systems fail, the remaining online systems can take the remaining load.
“The change in infrastructure now is that we have both physical data center options and cloud-based options. The cloud is an easier way to reduce the risk of single points of failure. I don’t buy and run two cars in case one breaks down—I lease one from a company that has 1000’s of car options,” says Alan Walsh, CEO of Amido, a London-based technical consultancy that rebuilds old-fashioned infrastructures in the cloud.
It is also a lot easier to design these cloud systems for failure. A process known as resilience engineering (where designers deliberately attempt to break systems) is becoming increasingly important. It means engineers are forced to build and deliberately break things, which, in turn, creates a positive feedback cycle.
Netflix has been leading the way with its Chaos Monkey tool that randomly attacks its own network to test its weaknesses and ensure that it survives any form of failure without affecting its customers.
The company says the name comes from the notion of unleashing a wild monkey with a weapon into a data center —which sounds a bit like what happened to BA’s data centers last weekend.
Another company that is ‘hyperscaling’ in this way is online retailer ASOS, whose whole business depends on speed. Its website has 13.4 million active customers, 125 million unique visitors per annum and 100 million page views daily.
According to Amazon, every 100 milliseconds of delay or latency costs it $2.6 billion while Google says that a 0.5 second delay leads to a 20 percent decline in search traffic. Annual events such as Black Friday and its importance in retail mean ASOS must plan for failure.
ASOS realised that its existing IT architecture would not support its next stage of growth and so transitioned to the cloud to be part of a new global architecture.
Not only is the new platform geographically distributed, meaning it can withstand power outages, but it can also keep pace with developments in e-commerce such as augmented reality and conversational commerce. Such a strategy leaves ASOS well-prepared for random IT accidents, unlike BA’s current processes.
IT failures such as last month’s BA outage means that while the company’s planes are stranded on the ground, it should be looking upwards to the cloud to prevent any further losses in revenue and customer experience — true Blue Sky thinking.
Monty Munford is a technology journalist who has worked for 20 years in the mobile and digital sectors.