The following is the incident report for the service disruptions that occurred during the week commencing on the 17th September 2018. We understand the serious impact this had on our customers and first we want to apologise wholeheartedly to those that these disruptions affected.
Below you will find a summary of the events. This includes a background of how our service works and the work we are currently undertaking, what led to the disruption as well as the root cause and finally some action items such as how we intend to stop such disruptions occurring in the future.
Background and Outage Summary
Our platform utilises a number of different servers in the UK and Ireland which are housed in different data centres. This adds to our resiliency and ensures that in the case of an issue, we are able to move traffic away from the affected network and route traffic to other data centres. Recently, we have begun the process of introducing a new proxy platform, designed to improve our overall quality of service and expand our resiliency even further.
In order to operate a cluster of redundant SIP proxy servers, we require a method of sharing data between them. In our older platform we achieved this by replicating inbound registrations to the redundant host. While this had the benefit of being a simple solution, the older platform only operated in an active-backup manner, meaning that we had allocated servers to be used solely as a backup.
The new proxy platform does away with this limitation and allows active-active operation, meaning all servers are used actively, sharing the load in real-time. Should an outage occur, it would generally limited to one location and a subsection of customers, who can then be moved to another operable active proxy at a different location.
The scaling of registration state with the old method of replication was a concern that required addressing. Our newer platform made use of a clustering module, to facilitate smarter sharing of state when necessary, essentially minimising workload and the amount of bandwidth used.
Following the first instance of disruption, we took the step of adding additional automated monitoring to catch the problem before it could impact customers. In addition, on-call engineers were asked to take detailed snapshots of the system as it stood while the problem occurred. Upon concluding this action item, we discovered the clustering module was at fault. The module had the ability to lock up threads and sockets in certain circumstances. While these resources were locked, registrations and calls would fail to be processed in time.
Root Cause of Disruption
The fault with the clustering module was the root cause of the disruption last week. While we would have liked to avoid additional disruption, the technical domain of the fault made diagnosis difficult, and remediation slow.
Resolution
Once we identified the root cause, we set about porting the older replication-based solution to the new platform. This work, alongside work to enhance the performance of our databases, was finished on Friday.
Since this implementation we have seen no re-occurrence of the fault. While this does have implications as to the expansion of our new proxy platform, these can be managed as the need arises. Work towards delivering a mature and stable proxy solution continues, but we believe the prognosis is good and do not expect further disruption.
For those customers affected by this outage we once again apologise unreservedly and want to restate our commitment to providing you a robust and resilient service. We believe the changes we have made, and the planned works outlined above will reinforce this. We at Pebbletree greatly appreciate your patience both during the outage and the period that followed, and welcome further queries relating to the incident or RFD.
Sincerely, The Pebbletree Team