Why problems make us stronger
Problem Management is about establishing the root causes of incidents, and mitigating against them in the future. Let’s say (hypothetically) a client’s service stops working on a Friday afternoon. They raise a ticket. We look at all the relevant dashboards, check the hardware and take whatever remedial action is required. The service is restored and we close the ticket. That’s an incident.
A problem is defined by ITIL as the ‘unknown cause of one or more incidents’. So where Problem Management comes in for us is when we analyse recent tickets and discover that the same thing has happened again, or maybe it’s the same service with a different client. While our actions during these incidents were enough to restore service, they may not have been enough to prevent the incident from reoccurring. That’s bad for the client, of course, and also bad for Navisite, in terms of the efficiency of the business. In an ideal world, we wouldn’t have to make the same fix more than once – which is where Problem Management comes in.
How we work
An engineer is assigned to the problem and they, along with any necessary additional experts, do the detective work to isolate the root causes of the incidents, which might actually be several steps away from the visible outcome that was reported as a fault. There might be a device in use that is failing intermittently without being detected; issues with the client’s hardware; software incompatibilities following a recent change; or a myriad of other possibilities. Sometimes it’s a bit like CSI as we narrow down the possibilities and test out theories.
Once the root cause is established, a solution is fed into the Change Management process we’ve already described. We then put the problem into a monitoring phase. In other words, the engineer responsible can move on to other jobs, but keeps a watchful eye out to determine whether our detective work was successful, or if we need to re-open the case.
The effects of a successful intervention can be quite dramatic. If we discover that the root cause of the hypothetical client’s service disruption means that we need to make a change to our standard server builds, then that change affects all our current and future clients and should significantly reduce the number of incidents occurring. This makes it obvious why Problem Management is an important long-term investment: it’s very easy to get bogged down in day-to-day operations, but that can lead to engineers addressing the same, or similar tickets, again and again.
We can give a real-life example. Whenever we rebooted certain servers, as part of regular procedures, a particular switch connected to those servers raised an alert because a device it was expecting to see had suddenly become absent. This lead to two tickets being created – one for the reboot and one from the switch. Fortunately, we were able to change the configuration of the switches, so that it no longer raised an alert from server reboots, and immediately far fewer tickets were being raised on our systems.
A reduced number of tickets is the ultimate metric by which we can judge our success at Problem Management. It means we are creating an improved service rather than simply tackling whatever comes up next. That gives more peace of mind to clients, who are seeing and experiencing a smoother service. It also means that, rather than repeatedly making the same fixes, our engineers are able to devote their attention to project work or assist with the development of new services.
As we go forward, more and more of our Problem Management will be pro-active – anticipating where problems might occur and preventing them before they happen. Then, as technology improves, we can expect problem-solving to start to become automated in some cases. There are already very sophisticated sensor and control systems at work in our systems. As artificial intelligence advances, it begins to allow us to automate much of the ‘legwork’ in tracking down the causes of problems, and then starting to solve the problems for themselves as we enter the era of the ‘self-healing’ data centre.