Change Management without tears
Among many system administrators and engineers, there’s a strong suspicion that ‘change is the root of all evil’. Systems that are up-and-running, with changes locked down, don’t tend to go wrong very often. Indeed, most major incidents are created by change.
But change is also inevitable and hugely important. Hardware needs to be upgraded. The scope of what’s required from the system alters to suit new business requirements. New software packages are required. Mandatory patches need to be applied. We at Navisite make improvements to our own cloud hosting platform to evolve to meet the changing needs of our clients. In fact, when managed correctly, change will improve and make systems and applications more effective than before.
But to make a success of them when such circumstances occur, then these changes need to be managed. We’ve already discussed incident management, which is about dealing with unplanned faults and outages. Change management is about planned alterations. It’s about managing change in a production environment in such a manner that minimises the risk of failure or unexpected outage.
Proper Planning Prevents Poor Performance
The key part of that – the part that is most likely get your systems through a change without problems occurring – is the plan. At Navisite, we work to ITIL processes and recommendations; the leading industry methodology for IT processes, and all of our engineers are ITIL-certified. So long as people stick to the process, then the risks become very manageable.
It’s also about communication with our clients. If they request a change, we’ll create a plan and a schedule, and we’ll ensure that they’ve signed off and are happy with the agreed plan before proceeding. We make sure changes are documented through both emails and the Proximity Client Portal that the client can use for day-to-day management of their environment. Timings will vary according to the complexity of the change required. Something relatively simple, like adding a new route to a firewall, can probably be completed within a few hours of the change being requested. More involved requests, or where it involves very critical parts of the client’s data – updating a SQL database, for example – might take longer with a period of very intense planning and testing, where possible, to make sure the plan is water-tight.
We divide the planning into several parts to make sure everything is covered. The scoping process describes as fully as possible the nature of the required change. Our process forces the engineers responsible for the plan to describe the changes in such a way that they could be completed entirely by another person, if necessary, thereby ensuring that no detail has been forgotten.
Then there is the risk and impact assessment. What could go wrong? And what extra steps need to be taken if one of those things happens? At the same time, we work out the test procedure. These will need to happen at each step and everyone with the slightest connection to the system concerned needs to be aware that they need to report any unusual or unwanted effects. (IT systems sometimes seem vulnerable to a ‘butterfly effect’ – it’s people several links down the chain from where the change is occurring who first see a negative impact when things go wrong. They all need to be aware and reporting in.)
The getaway plan
We then work out the timings and the back-out plan. The back-out plan is interesting.
Let’s say we have agreed a three-hour change window with a client to carry out the changes. We need to set breakpoints within that window when we must decide whether everything is proceeding to plan. Or that it’s mostly fine and we’re entirely confident that we can fix any complications within the time-frame. Or that the procedure isn’t working and that we need to implement the back-out plan to restore the systems to where they were before we started. In short, we have to assess the risk and possible consequences of failure, balanced against the chance of success. This point shouldn’t happen at two hours, fifty-nine minutes, however. It’s typically just over half-way through the three-hour window – because backing out won’t happen instantaneously and the whole system needs to be live and tested again for end users before the end of three hours. Many Navisite clients have their own third-party clients, and unplanned outages are simply unacceptable for those users.
Lastly, we attempt to stage a dress rehearsal of the whole change process using redundant systems where possible in a test or pre-prod environment – Or often the ones that are either used for development or to supply extra capacity. We’ll go through every step and test at each point. If something doesn’t happen the way we’d planned, then clearly there’s something wrong with the plan, and we’ll go back to the drawing board before making any changes to your production systems.
We also use change-management as a means to refine our processes. Changes are audited on a regular basis to get an overview of what’s happening across our systems. If we find, for example, that we’re making exactly the same changes for various clients on a regular basis, then we should probably look at making that adjusted system a part of our initial set-up. Those audits also help us to understand whether our team is performing well or whether extra training might be required.
At Navisite, we’re very confident in our change management process. Across the IT world, most problems are caused by human error rather than the technology itself. And that’s why at Navisite very rigid, detailed process management is in place – while there will always be people involved in changes (for the near-future, anyway), we can take their individual foibles out of the equation with the consistent application of carefully created and refined processes.
Our Client’s aren’t just buying our technology and people skills but also our process expertise.