The following is the incident report for the service outage that occurred on Thursday, the 13th September 2018. We do understand the serious impact this had on our customers and first we want to apologise wholeheartedly to those that this outage affected. Below you will find a summary of all the events. This includes a background of how our service works, a summary of what led to the outage, lessons learnt, action items such as how we intend to stop such disruptions occurring in the future and finally a full time-line of the events.
Outage Summary
During the early hours of the morning, two of our new proxy servers began to experience a degradation of service. A very small percentage of our customers may have been affected at the time. Our automated monitoring tools did not pick this up at the time the issue started to occur. This will be an action point going forward.
DAt the beginning of the working day, we endured some localised network disruptions due to scheduled PAT (Portable Appliance Testing) testing. This had no bearing on the ongoing issue, but it prevented us from diagnosing and correcting the issue as soon as we would have liked. Our support lines were operational in time for support hours, after which we were made aware of the overnight degradation of service with the new proxy server. Calls made through this server during this time will have failed. The primary failing server was brought out of production immediately. This action brought about the resolution of the disruption caused to the affected customers.
Lessons Learnt
What went well
- Once the issue was identified, a resolution was reached shortly thereafter.
- The issue only affected customers using our new proxy platform. The majority of our customer base, which use our legacy proxy platform, were unaffected.
- Following identification of the issue, the architecture of the new proxy platform allowed us to restore full operability to customers within minutes.
What went badly
- The degradation of service began during the early hours of the morning, outside of normal business hours and was not identified by automated 24/7 monitoring. As a result, the on-call engineer was not alerted to the disruption.
- Furthermore, due to an unrelated localised network issue, we were unable to start diagnosing the issue as soon as we would have liked.
Action Items
- We will be expanding our range of automated checks against the new proxy platform to provide ourselves a more granular and more sensitive view to issues on these servers. One of these checks, mentioned above, has been implemented today.
- We will be reviewing our support and engineering staff's network in order to eliminate the potential for faults, such as that which occurred this morning.
For those customers affected by this outage we once again apologise unreservedly and want to restate our commitment to providing you a robust and resilient service. We believe the changes we have made, and the planned action points outlined above will reinforce this. We at Pebbletree greatly appreciate your patience both during the outage and the period that followed, and welcome further queries relating to the incident or RFO.
Sincerely, The Pebbletree Team