CloudOps: How to Bridge the Gap Between ITIL and the Public Cloud
Large enterprises struggle to bridge the gap between their well-established (and effective!) ITIL processes and the new world of the public cloud.
The latter is known as CloudOps.
What Is CloudOps?
CloudOps is the practice of IT operations applied specifically to a cloud environment. It's a set of practices and processes to build, deploy and maintain applications and workloads in a cloud environment sustainably.
So how does CloudOps differ from traditional IT ops and why do enterprises struggle to bridge the gap?
The CloudOps Conflict
The source of the conflict between traditional ops and CloudOps is the pace of change.
The ethos of ITIL is to be risk-averse and to control change. You avoid failure at all costs, principally by limiting change. This works for on-premises environments because it matched the way that servers were procured and maintained. You’d have the same server running for years, so it was important to keep it in a controlled state.
This is reflected in the kinds of heavy ITIL processes around making changes: change approval boards, raising tickets in relevant systems, going through various workflows and approval processes. There’s a lot of stuff that needs to happen before a change can be made.
But this doesn't work for the cloud!
In the cloud, we architect specifically for risk and failure. So change is accepted and expected. Cloud-native applications are designed to be fault tolerant and to change quickly. These expectations are baked into the fabric of the software itself from idea to deployment.
As a result, the cloud is all about rapid change. Cloud-native app dev teams might be expected to be able to deploy 50+ times a day. The declarative nature of the infrastructure, the ability to implement proactive fixes and the rapid feedback cycles required for iterative development mean that deploying change happens much faster than in the traditional world.
Going into another system and raising a ticket simply doesn’t work in this context. It slows the whole thing down. The dev team is impacted because they can’t get changes out on schedule...and this delay trickles right back down the stack.
An example of the problem of ITIL in the cloud world: imagine devs detect a problem in live. They code the fix and want to deploy it straight away! But if they have to raise a request in a system that has to be manually approved by someone else in a different department - the whole thing falls apart. They have to maintain artificial gates in their deployment pipeline to stop change going out while they wait for approval.
This defeats the whole purpose of the cloud! You’re then not reacting at speed. You’re not taking advantage of the technology you just implemented by putting artificial barriers in place.
How to Go CloudOps: Transforming Your ITIL for the Cloud
Most companies have a very high-functioning and well-optimised ITIL organisation. There are thousands of employees perhaps working in ITIL ways.
But these same organisations know that they need to make the transition to the cloud. This includes making changes rapidly.
So how do they use what they’ve got without getting in the way of the benefits of the cloud?
In this blog, I will drill down into how you can combine your existing ITIL organisational structure, processes and controls with cloud-native ways of working and showcase how we’ve successfully implemented IT change, release and configuration management processes adaptations for our clients.
The key idea is that you have to create space for cloud-native ways of working within your existing ITIL organisation.
This space becomes a hook on which you can hang new behaviours that enable rapid change...while still keeping track of who is doing what, which teams qualify for new behaviours and which workloads can be changed at speed - so that you maintain the security and control you need.
We could call this the ‘continuous change model’, as distinct from the ‘standard change model’. Let me explain!
The Continuous Change Model
This model allows you to track changes effectively. Who is operating in the ITIL space and who has all the necessary automation in place to be able to go fast in the cloud?
This lets you keep track of your slow transition from an ITIL-centric organisation to a cloud-centric organisation.
TL;DR? These are the three key takeaways:
- Allow the space for change to take place
- In the old world we avoid change and put in controls to ensure that change is affected in a controlled way
- In the new world we make space for change, we accept change and we architect our processes and applications accordingly
- Bake in quality from idea to operation
- In the old world we have testing and hardening phases to ensure the quality of our applications
- In the new world we bake in quality into every commit through our CI & CD pipelines
- Bake in traceability from the outset
- In the old world producing release notes and implementation procedures was laborious and inconsistent
- In the new world we bake in traceability for every change into our SDLC management tools so that we know who made which change and why at any moment at any moment throughout the life of the application and in real time
1. Create a New Space for Cloud-Native Change to Take Place
One of the ways we have experienced success in this space is by augmenting the base change types.
ITIL suggests three change types (according to ServiceNow documentation):
- Standard: “A standard change is a pre-authorized change that is low risk, relatively common and follows a specified procedure or work instruction.”
- Normal: “These changes require a full range of assessments and authorizations such as peer or technical approval, change management, and Change Advisory Board (CAB) authorisation, to ensure completeness, accuracy, and the least possible disruption to service.”
- Emergency: “This change is of such a high priority that it bypasses group and peer review and approval and goes straight to the Authorization state for approval by the CAB approval group.”
We suggest adding an additional change type “Continuous Change” to allow for change to be made with a request for change (RFC) created automatically in the system of record (e.g. ServiceNow) which does not require review or approval.
This change type is reserved for teams which have proven an ability to successfully deploy change through the standard approach. These teams must have proven their ability to avoid failure by automating the necessary controls into their CI/CD pipelines. Changes of this type must also be kept small and made frequently.
This creates a space for teams that are ready to start working in cloud-native ways and rapidly deploying change.
Adding this extra change type has two main benefits.
Firstly, it allows easy segmentation between changes which are truly cloud native and those which are monolithic/traditional. Secondly, it highlights which applications must ensure the correct CI/CD hygiene is in place making it easy to focus attention on just those teams which are operating the new paradigm.
Overall, it lets you track which teams are operating in the ITIL space and which qualify to be allowed to go fast in the cloud.
A Few Tips...
Experience has given me a few extra tips that are useful here.
Never Break Ranks
We have found a number our clients allow “extra curricular” behaviour to occur when an emergency change is raised.
Due to the severity or the urgency of a change, it is decided that it’s worth cutting corners and allowing some form of manual change to be made to the live configuration.
This has to stop in the cloud!
We advocate that, whichever type of change is made in the cloud, the same automated deployment process should be used. The fact that it is an emergency should only relate to the priority of the change. Everything else should remain the same.
We do not skip steps in the cloud-native world, we automate them.
This can affect the way in which application support teams need to remediate problems with cloud-native applications. We tackle this through a number of methods depending on the clients particular situation. If we are working with a truly cloud-native application then the organisation should have made steps to ensure that the team responsible for the ‘design’ and ‘build’ of the application will also be responsible for the ‘run’ aspects. Thus removing the traditional Dev/App support team divide. This is for another article but we lean heavily on SRE principles in this space.
Analyse Your Platform/Application Structure
Having allowed the space for rapid change to take place it is necessary to understand where this change type can be safely used. Breaking down your platform into logical components can help with this as follows:
This list is by no means exhaustive but serves to illustrate the types of components to consider as part of your CMDB entries.
Provide Change Type Guidance for Each Component
Having decided the components of your system we have found it useful to give guidance to our clients around the different change actions that can be carried out on the platform components.
This opens up as much space for change as possible while still maintaining control over mission-critical components.
So you can determine, at a more granular level, which components of your applications can be changed rapidly and which need to go through a slower ITIL process. Because some things (e.g. critical apps such as ‘delete’ changes to your billing app) still need to be carefully controlled.
Getting our clients to think through making changes such as ‘Add’, ‘Remove’ and ‘Update’ to each component can help facilitate conversations around these components and ultimately creates the change contract between the development teams and the operational teams. In addition to this beginning to think about the frequency of change can also help guide which change types are applicable to which component.
This analysis forms the minimum standards for use of the fully automated ‘Continuous’ change type. If there are any additional risks or anomalies with a particular deployment the development team should opt for the change type which requires greater scrutiny. The easiest way for development teams to lose the trust of the operational organisation and drive a wedge between Dev and Ops is to begin to use change types which are not suitable. If dev teams do this and deploy major change as a continuous change then the ability to deploy continuous change should be revoked. A postmortem should be carried out and a remediation plan created.
Once you’ve created the space for change and earmarked which components are candidates for rapid change, it’s time to ensure that the necessary controls and automation are in place to allow teams to qualify for that change model.
2. Bake Quality in From Idea to Operation
Automating the necessary security and testing controls must become a necessary part of qualifying your team for the Continuous Change Model.
Development teams don’t automatically get carte blanche access to deploy-at-will just because they bounce around the term ‘cloud-native’. They need to do their homework. The CI stack becomes critical in this situation.
As with any software team the CI stack is where we should be looking to implement our checks and balances to ensure the quality of our product. This is also the space in which we strive to back in automated quality and compliance checks. The more regulatory controls that can be baked into the CI stack the better. We advise that our clients actively seek controls which can be included as part of this process and we help them make that a reality.
Automation, automation, automation…
With the pace of change so high, automation is the only way in which change can be made if we wish to maintain efficiency through the operational departments. Removing the traditional manual checks is only possible for teams which can prove that they have replaced each of the relevant controls with automated checks in the CI stack.
We suggest the following as a minimum requirement to qualify for ‘Continuous’ changes:
- Automated Infrastructure as Code
- Automated Peer Review
- Automated Version Control
- Automated Builds
- Automated Style Checking
- Automated Static Code Analysis
- Automated Security Testing
- Automated Unit Testing
- Automated Deployments
- Automated Behaviour Driven Testing
As you may have picked up, a lot of our recommendations involve automation :-).
We have guidance for each of these areas but that is outside the scope of this post.
Be a Good Change Citizen
It is incumbent on the development teams which are lobbying for the push into the cloud-native space to ensure that they have provided the necessary automation to allow for this to happen efficiently.
So that’s it then? Allow for higher rates of change, work out how your platform is structured, put the right automation in and we’re good to go? Almost…
3. Bake Traceability in from the Outset
You need to keep a record of all the things. This ensures traceability throughout your entire tech stack.
As change is occuring more frequently it’s essential that it is possible to see what state the platform and systems are at at all times.
- WHY the change was made (Jira ticket)
- WHO made, reviewed and approved the change (Git commit)
- WHAT was changed (Git commit)
- WHEN the change was made (CI/CD timestamp)
- HOW was the change implemented (CI/CD audit logs)
All of these properties should be readily available for each and every change. A full and transparent audit trail of all changes made is imperative when the pace of change is so high. If an incident occurs it must be easy for the responding teams to understand what changes have been made and when the fault started occurring. High quality instrumentation must be in place to ensure that this is possible. At the application level but also at the shared services and provisioning level.
Implement automated legacy integration
You’re unlikely to be able to ignore existing legacy operations management systems at least in the first instance. When you first move into the cloud with your first cloud native workload you may represent <0.1% of the entire operational estate. As such it is often necessary to integrate with the existing processes and tools. We suggest that if a system has a modern API then steps should be taken to integrate with that API to allow automatic updates to the legacy system.
In the event that the legacy system does not expose a modern API we suggest looking for alternative solutions. If a vendor has not made steps to expose such an API it’s likely they have major architectural issues and we would advise our clients to steer clear of them!
One common integration we often encounter is that which allows us to link WHY things have been changed with the rest of the change record. In this instance we suggest integrating between the likes of Jira/Rally and ServiceNow using the available APIs.
By expanding your ITIL processes to include something like a ‘Continuous Change Model’, you can safely and successfully execute and track your transition from a more risk-averse ITIL perspective to a change-centric cloud setup.