Taking a mature approach to incident management
Stuff breaks. It’s a fact of life. But a key differentiator with a cloud infrastructure provider is the way in which they prepare for incidents; the way in which they handle the incident as it’s occurring, and how the situation is recovered. It’s this latter factor – how we manage the recovery – that tends to be most important for clients.
A mature managed cloud provider, like Navisite, will understand and make use of ITIL processes for operations management: it’s a core methodology and set of values that we work with, putting maturity and consistency at the centre of company culture.
These processes put the client’s business success front and centre. One thing we find a lot of organisations don’t do is fully understand the business impact of technical incidents. A lot of organisations treat incidents as purely an IT failure – as though that’s somehow a separate issue from the client’s business. But that IT failure is actually underpinning a business process, and understanding that process is vital. We educate our front-line people to ask the right questions to understand from the client what the incident means to their business and thus the impact. From that, we can correctly prioritise how to manage that incident – with different levels of processes based on the severity of the business impact.
Most incidents will be relatively minor and not even noticed by clients – a component fails, but it’s picked up by our diagnostics and the redundancy built into the system means that it doesn’t affect the front-end service. Designing the right level of resilience into systems enables availability. We can route our clients’ data around the issue. Our engineers then switch out the faulty component for a new one. We look at whether something could be changed to drive service improvements. And your business keeps running smoothly.
Major incident ahead
Even with 99.999 per cent uptime SLAs, major incidents happen. A major incident is one in which the client suffers some form of outage: they’re feeling the effect on the front-end. How we react and deal with that in a professional manner is key to our service.
These major incidents can sometimes be the result of component failure, but more often than not, they aren’t. When you have a change-freeze, things don’t tend to break nearly so often. Most major incidents are the consequence of a deliberate change to an established system. Unfortunately, though, change is necessary, because you can’t maintain systems without making changes. What typically happens with major incidents is that a new piece of software is introduced which causes an interoperability problem, or another is replaced and it again causes problems. The implementation plan was incorrect or it wasn’t rehearsed, or both. This is one reason why people who work in operations management are so keen on proper processes. But I digress: change management is a topic for another blog post.
And it’s here, more than ever, that communication is vital. When something goes wrong, you don’t want to be chasing your cloud services provider for an answer – they should already be in-touch as a trusted partner. That’s why we are constantly communicating, giving updates to clients, engaging and bringing them into the trouble-shooting process, as required. A consistent piece of client feedback on our service is that it’s not that an outage happened, it’s about how well you manage the recovery. And that it is done in a form that’s open and transparent, and in collaboration with the client.
This collaborative element is crucial because Navisite, generally speaking, is most likely to be providing a platform on which the client is running their business applications. So when we’re recovering a situation, we’re only able to recover up to a certain layer. The client then has their application on top, which they in turn need to work to bring back online before the business issue is resolved. As an extension of your internal team, we treat it as a business problem and the incident isn’t over – and we’ll continue to advise and assist – until clients are completely restored to normal business operations.
Back up for bulldozers
No matter how or why it happened, Navisite needs to own the problem, communicate continually with the partners involved, identify why the problem has occurred and work with clients to ensure a return to full service. Quite often, it’s not something we could have foreseen or prevented: it might be a problem with the client’s own network, for example, or with the service from another provider. No matter: our engineers will offer their expertise and knowledge to help solve the issue.
Recently, for example, a client got in contact because they’d lost connectivity. After we’d looked into the matter and discovered no fault with our systems, it emerged that the cause was that their local telecoms provider had suffered a cut through some important fibre cables caused by an over zealous road maintenance team with a bulldozer!. Happily, we were actually able to help in this instance, by re-routing their connection through a different provider. That sort of solution might not always be possible, but it’s a good example of what it means when we say we’ll take ownership of problems raised.
Maintaining trust and a good relationship is directly related to the level and quality of communication with a client during and after any incident, and it’s an area in which I believe Navisite excels. The hub of this is our Proximity portal, an all-in-one client gateway to interacting with Navisite and other IT partners. When an incident occurs, a ticket is raised on Proximity and then, as our engineers work on resolving the ticket, they give real-time updates through the system. These are the same records that we use internally, so it’s a totally transparent process.
At the other end of our incident communications cycle, once the incident is resolved, a report is created. Because these reports typically have different stakeholders in different levels of our clients’ businesses, there are actually two sections in the reports. The first is an executive summary for C-level executives: what happened, why and how it might be avoided in future. The second gives a full technical breakdown on everything that occurred and was discovered during the incident.
Will incident management ever get easier? I hope so. As we continue to rebuild our internal diagnostic tools and take account of improvements in the capabilities of technology, we’re finding more ways emerging to pre-empt incidents, to automatically schedule the replacement of components or the rebuilding of systems before anything happens. Stuff will continue to break, I’m sure – but if we can find ways to predict that, it’ll be a big jump forwards.