The following is the incident report for service outage that occurred on Tuesday, the 11th September 2018. We understand the serious impact this had on our customers and first we want to apologise wholeheartedly to those that this outage affected. Below you will find a summary of all the events. This includes a background of how our service works, what led to the outage, the root cause for the outage, lessons learnt and finally action items such as how we intend to stop such disruptions occurring in the future.
Background
Our platform utilises a number of different servers in the UK and Ireland to manage call traffic as well as our Quvu dialler system amongst other things. These are housed in different data centres which adds to our resiliency and ensures that in the case of an issue, we are able to move traffic away from the affected network and route traffic to other data centres.
Outage Summary
At 11:30 this morning, the primary legacy proxy server was proactively removed from production to mitigate a potential memory issue. This proxy was brought back into operation remotely at 13:15 in order to restore resiliency of our service for customers using the legacy proxy platform. At this point, due to what we believe is a low-level fault, the server immediately started to broadcast large amounts of data over the local network to multiple hosts, in what is known as a broadcast storm. As this was a proactive task being carried out by our development team, we were immediately aware of it. The command to shut down the server was issued, however the broadcast storm continued.
Due to the quantity of traffic being sent out by the faulty proxy, switches in the network quickly became congested and failed to handle traffic. This in turn resulted in difficulties for all customers logging into Quvu. Customers calling through the legacy or new proxy platform would still have been able to make calls, however due to the congestion caused by this issue, there may have been instances of call failure where affected media servers were used.
In order to resolve this, a channel of communication was established with on-site data centre engineers who were able to physically remove the server from production. At this point we immediately began to see Quvu customers logging in and the resumption of service across the board.
Root Cause of Failure
The root cause of the failure is currently being investigated. Given the extreme quantity of traffic broadcast over the network, we do expect it to be a low-level fault with the server in question. We expect analysis in the coming days will confirm the nature of the fault.
Lessons Learnt
What went well
- The issue was identified immediately. The scope of the outage was managed and kept to a minimum./li>
- Manual dialling and other services, such as account management and voicemail, were completely unaffected as these make use of servers in our other data centres.
- Following the removal of the faulty equipment, we have seen stability and full operability of all affected services.
What went badly
- Our partners at the data centre were not informed prior to the change, leading to a longer than acceptable time resolution.
- Our remote over-the-network method of shutting down the server did not resolve the issue as the broadcast storm continued.
- The processes in place for restoring the administratively down server were not fit for purpose and failed to consider this type of disruption.
Action Items
- Further investigate the root cause of the issue in order to inoculate ourselves against similar low-level faults.
- Develop our procedure for restoring hosts that have been removed from production.
- Investigate additional out-of-band methods for remotely powering off a host over the network.
For those customers affected by this outage we once again apologise unreservedly and want to restate our commitment to providing you a robust and resilient service. We believe the changes we have made, and the planned action points outlined above will reinforce this. We at Pebbletree greatly appreciate your patience both during the outage and the period that followed, and welcome further queries relating to the incident or RFO.
Sincerely, The Pebbletree Team