HOPS


Back to Home | Back to HOPS Support

HOPS Downtime - 27/28 November 2019

We apologise that we are offline overnight on 27/28 November.

This was due to a data integrity issue, caused by a small bug in the part of HOPS that manages user details, that needed to be fixed before the system could be returned to service.

Staff were working in HOPS HQ until 2am. Unfortunately it was not possible to resolve the issue satisfactorily, even with the help of the level 1 and 2 24-hour on-call staff from our hosting provider. A level 3 engineer booked on at 0700 and the system was restored at 0740.

Changes that were made to users' postal addresses and emergency contacts between 1730 and 1900 last night (27/11) may not have been fully saved. Users' telephone number and email address updates are not affected, nor is any other part of HOPS. If anyone should find anything still amiss please contact us on 0118 321 8752 immediately.

We apologise for the inconvenience caused by this outage.


MORE DETAILS...

We are aware that, to many (most) HOPS clients, the system is business-critical and we don't take downtime lightly.

Last night's problem was started when update work was uploaded at 1800 (27/11). The issue was identified thanks to reports from HOPS Admins at the Dean Forest and other lines, as causing some users to have become invisible and the system was taken offline deliberately to prevent damage from any attempted "corrections" at 1900. Staff returned to HOPS HQ and extensive work was undertaken, with help from our server provider, until 0200 (28/11) to identify and attempt to resolve the issue, which transpired to be connected to users' emergency contact details. This is the sort of work that is normally done calmly, offline, during 'business as usual'. Due to the complexity of HOPS, it was identified that input would be required from external engineering staff not on-call 24/7, so work was paused until 0700 (28/11) when these staff were available. The required data was available again shortly after and online access to the system restored at about 0740 (28/11).

I would like to say comforting (if anodyne) things like 'we'll take steps to make sure this doesn't happen again', and of course I hope that it won't, but I have always been frank with you all and I will be again now: Little bugs and niggles occur all the time. Most of the time they affect only a small part of the system and most of the time they're fixed within a few minutes. Sometimes small parts of the system are taken down while they are fixed, and most of the time only a small proportion of users are affected or even notice. It was unfortunate that although last night's issue was just a little bug, it was in a component of HOPS that is used in nearly every part of the system (user details), so it was very noticeable.

Just like one tiny mechanical component on a steam locomotive can fail, even though engineers do everything reasonably possible, with the resources that they have available, to avoid that one embarrassing failure, occasionally it will occur and put the whole machine out of use. Last night one tiny component of HOPS (emergency contact details) failed, even though we do everything reasonably possible, with the resources we have available, to avoid that embarrassing failure, it occurred and put the whole machine out of use (brought offline deliberately). This wasn't a derailment, it was a tiny vacuum leak, but it still stopped the loco... across the points at the throat of the station preventing any other trains running. A small fault, in the wrong place at the wrong time.

What I can say, that hopefully gives more confidence, is that the resilience of HOPS is increasing all the time. We are undertaking work that is far stronger and more robust than we ever have before. We are involving far more (expensive) external professional services than we ever have before, and although we tripped up last night, I am confident in the performance of HOPS.

All websites have downtime occasionally, and, even if HOPS was down for a whole day, its uptime would still be 99.7% on the year. HOPS is currently on 99.86% uptime in 2019. (99.87% in 2018)

A lot of favours were called in last night and this morning, and a lot of expense will be incurred (both corporately and personally) so please don't think that HOPS and I don't feel the pain too. Although it is embarrassing that we were offline overnight, I am proud of the work that was done by our staff and contractors to resolve the issue, and that our response plans worked in anger and under pressure to restore the system.

I apologise again for the inconvenience caused by this outage and hope that the above detail goes some way to explaining some of the background. If anyone wishes to discuss any further or has any other questions please get in touch via any of the normal channels.



HOPS is continually being developed and updated. Sometimes the screen shots in these help files might lag behind the most up-to-date views of the screens. Generally, however, the functionality of the page will be the same, albeit with a slightly different format or layout.