23 KILLER Cloud-Native Development Principles and Practices
What are the best ways of being cloud-native in practice?
Following on from my introductory blogs on cloud-native architecture and cloud-native software development it’s time to look at some really powerful principles and practices to bring cloud-native to life.
So let’s jump straight in. In no particular order, here are 23 of my top cloud-native principles and practices!
- View Software as a Service You Are Delivering to the Customer
- Use Feature Toggles/Flags to Open New Possibilities
- Canary Releases Keep Production Safe
- Ensure Your Release Pipeline Is Quicker Than a Greased Badger
- Use Trunk-Based Development for Max Deployment Frequency...If You Dare
- Tiny Commits Help You Go Faster
- Only Have a Production Environment
- Use Continuous/Synthetic Testing to Test New Features
- Test in Production to Massively Increase Confidence While Moving Fast
- Microservices and Containers / Functions and Serverless: The Cloud-Native Architectures of Choice
- Stateless Microservices Accelerate Change
- Observability (Monitoring, Tracing, Logging etc.) Lets You Know What’s Going On Everywhere
- Distributed Tracing and Logging Track Problems Through Your Entire Tech Stack
- Cross-Functional Teams Ensure End-to-End Accountability
- Choreography Makes for Resilient Event-Driven Architecture
- Code for Resilience!
- Code for Longevity!
- You’re Only ‘Done’ When It’s In the User’s Hands!
- Practice Breaking Things on Purpose
- Ensure You Have an Exquisite Requirements Elicitation Process
- Don’t Fear Failure; Or, Develop Effective Innovation Cycles
- Look After Your Team!
- Take Your Time!
The Killer Cloud-Native Principles and Practices!
#1 View Software as a Service You Are Delivering to a Customer
This is the big one.
Software is a means to an end. The end is the service to the customer.
It’s a shame that ‘SaaS’ has already been coined, otherwise it would have been the perfect synonym for ‘cloud-native’.
Every team should be using the power of the cloud to deliver the best possible service to the customer. It’s not about vanity tech projects or cool languages.
If it doesn’t serve the customer, ditch it!
The rest flows from there.
#2 Use Feature Toggles/Flags to Open New Possibilities
Feature toggles (sometimes called feature flags) are central to many of the cloud-native principles.
They are a way of changing the behaviour of your software without changing your code. You can turn certain features of your application on and off easily, deploying code but keeping certain features hidden from users until they are ready.
This means that deploying to production is no longer synonymous with releasing software to users.
So you can commit to prod without worrying that you’ll break something or provide a poor service.
With feature flags, you can deploy a newly developed feature straight into production! You can do all the testing and tweaking you need to do there.
This means you don’t have to keep promoting a feature through all the different environments (nor host those additional environments).
All the time you save gets contributed towards faster delivery.
It also means that you can build in the option for certain features and then build them up - in production. This helps you to dev at the pace of design!
(Assuming your pipeline is lightning-fast...see #4!)
#3 Canary Releases Keep Production Safe
A related technique is to use feature toggles to perform canary releases: turning on a specific feature only for a small percentage (or specific subset) of your user-base.
The canary in the mineshaft!
This limits the blast radius of any new feature and lets you easily roll back if something isn’t quite right.
Cloud-native tools like AWS API Gateway or Lambda have this feature straight out of the box.
So if you release a new version of a function, AWS will deploy it and slowly route traffic through from the old to the new function. It will increase the number of requests until it either reaches 100% of traffic or you get an alert (i.e. an error or slow response time).
So if you accidentally code something that increases response time...you would release to (say) 5% of users before the response time hits (say) 250ms at which point the alarm is raised, traffic is routed back to the original function and the new one is torn down.
And the user doesn’t notice anything is off!
All while you’re off making a cup of tea.
This is a critical advantage that allows you to try things out at speed while protecting your user base from failure.
#4 Ensure Your Release Pipeline Is Quicker Than a Greased Badger
Slow pipelines are death by a thousand cuts.
You push code, then wait five or ten minutes for it to build. Then you wait three minutes for it to deploy…then wait for this and that...it all adds up!
When it takes even a little while to push, you’ll end up bundling up the day’s code into one commit at the end of the day to avoid the bother.
And whatever slight barrier comes between you and deployment makes it that much more unlikely that you will do small commits (see #6).
And bigger, less frequent commits are a danger to the kind of rapid, experimental development that is the aim of cloud-native ways of working.
But how do you accelerate your pipeline?
The best ways are to break your code up into microservices or functions (see #10).
When your codebase is smaller and more distributed it means that build and test cycles are much faster.
Note that this will not work if you have a QA team at the end of your pipeline wanting to see everything before pushing to prod. This is the ultimate bottleneck!
If you have human processes (other than manual approval) then everything is slowed down. So you’ll need some additional hacks to automate as much as possible:
CI/CD: automating the integration and delivery of code
Automated testing: tests should be codified as much as possible into the pipeline (see #9)
Static Code Analysis: automatically checking source code structure before it is compiled
Compliance as code: bake your security and compliance requirements into your code
Good application architecture: use microservices, functions, event-based (see #10)
Observability: track key metrics so you can see how quick your pipelines actually are (no guesstimating!) and improve that over time (see #12)
Feature toggles: then you can put your QA into prod! (see #2)
A super fast pipeline is the foundation for your cloud-native SDLC. Without it all the practices and principles lose their magic!
#5 Use Trunk-Based Development for Max Deployment Frequency...If You Dare
The above principles, taken to their logical extreme result, in trunk-based development.
This sees even the smallest of changes automatically reviewed, tested and committed on the trunk. This is possible when you have peer programming practices in place, you fully trust your developers and your pipelines are good enough to be relied upon as the peer review of everyone’s code.
Most of the time companies have a peer review system. A dev creates a short-lived ‘branch’, creates a pull request, makes their changes and then gets someone to review before merging back into master (or the ‘trunk’).
With trunk-based development, there is only the trunk! No branches. Everything gets pushed straight onto the trunk, with machines automatically checking the code. Canary releases (see #2) are used to stop bugs being deployed.
This is about the most extreme—but also the most impactful—way of increasing your deployment frequency.
But it depends on your risk appetite. There are many ways to put security at risk beyond just bad code!
Peer review can prevent more ‘human’ errors and (heaven forfend) maliciousness. Ranging from spelling mistakes on one end to wilful sabotage at the other (i.e. code that rounds up financial transactions and sends the difference to an offshore bank account!).
#6 Tiny Commits Help You Go Faster
All other things being equal, the smaller your commits, the faster you can go (and the faster you can home in on what your customer wants!).
(Assuming you have a greased-badger pipeline! See #4).
The more commits you perform, the faster you get features to your users and the faster you find out about any issues: either something that doesn’t meet user requirements (see #19) or something that isn’t working properly.
Because you’re committing frequently it’s easier to understand where an issue is.
Commit 6. No problem.
Commit 7. No problem.
Commit 8. Problem!
You know where to look.
And the more stuff there is in a commit, the harder it is to find the issue.
This way it’s also much quicker to deploy updates and fixes. And these small increases in speed add up insanely over time, resulting in massively superior end products.
#7 Only Have a Production Environment
If you reaaaaallly want to go all-in on the cloud-native: have only one environment.
People tend to build a lot of environments before production (dev, test, UAT, pre-prod etc.). But all the stuff before prod is only a de-risking device.
The only reason you have them is so that you can push changes that don’t affect production and that only your devs/testers can see.
The problem is that it slows down your pipeline. You get stuck in queues and have to repeatedly deploy.
Having the option of feature toggles and canary releases changes how you think about environments. You can push code straight to prod and achieve the same goal: only allowing your devs to see it.
Canary release = doesn’t affect production
Feature flag = only the chosen few can see
This means you can use tiny commits (see #6) to release and test tiny increments. So you might push a Lambda function live. Test it. All good. Then an API gateway that exposes that function. Then I put a feature flag on API Gateway so only people on our intranet can see it...and so on…!
Ultimately, the only thing that matters is how software behaves once it’s in production.
Got performance tests doing well on the test environment? Who cares? I want feedback about how it works in production.
So why not release to production directly?
(See also #9 Testing in Production!)
#8 Use Continuous/Synthetic Testing to Test New Features
Continuous testing (sometimes known as ‘synthetic testing) is the tactic of sending ‘synthetic’ (i.e. fake!) traffic to a system to test features.
By combining this with feature flags, synthetic data can be generated to monitor what would happen should a new feature be set to live. The UI behaves ‘as if’ the feature had been released, but real users are none the wiser.
Say you wanted to release a new chat feature. It passes all tests but the developer doesn’t know what the effect of releasing the feature will have on the back-end. By setting a feature to ‘synthetic’, the dev can see exactly what would happen if it were pushed live.
You can use synthetic traffic as your initial canary audience, so no real user ever gets errors.
This gives you a nice confidence-booster when releasing new features. By viewing telemetry and log data during a synthetic trial a dev can make a data-driven decision about whether or not to fully release a new feature.
Critically, this can all be done in production (see #7: Only Production), but you can also try it out now, in your other environments. Create synthetic traffic on a UAT environment and then trial some of these ideas to see if feature flags and canary releases prevent errors and bugs. Get better at lowering them on UAT and then move to doing it in production.
#9 Test in Production to Massively Increase Confidence While Moving Fast
The ability to test in production is critical to cloud-native development.
I always say: “A test in production is worth ten in UAT!”
Why? Even being fully confident that your software works in UAT still carries the worry that when you get to production something might go wrong.
But if it works for you in production, then that’s the best possible indication that it will work for your users.
Most people don’t do enough testing because it’s seen as a barrier to speed that gets in the way of deadlines. But while testing may reduce lead times in the short term, it gives you much greater confidence when you finally get to production.
This is where testing in production itself comes in. It gives you the best of both worlds: speed AND confidence.
How do you do that? Automation.
To demonstrate, let’s look at the testing pyramid.
As you can see, as you write more automated tests, you can balance the different types of testing better.
Most people I see are in mode 1: manual testing. This is slow, mainly consisting of end-to-end UI testing and normally includes regression tests for each change (imagine doing that when committing every hour!?).
When you get to mode 3, you have a nice balance of different kinds of test that yield the highest confidence that your software will work as intended.
And at the point where 95% of your tests are automated, you can start testing in production.
You do this by using feature flags. This means you don’t have to wait for the final approval before pushing to prod. You can START in prod, running all your (automated and well-balanced!) unit/integration tests while the feature remains hidden from users.
Note: you will always have some humans at the top of your pyramid. There are always certain things that computers can’t pick up like spelling mistakes. But that manual testing can also be done in production, where fixes can be deployed quickly (and because you’re already in production, you don’t need to move from test to prod. You’re already there!).
#10 Microservices and Containers / Functions and Serverless: The Cloud-Native Architectures of Choice
Containerised microservices and serverless functions are two architectural approaches that make it as straightforward as possible to enable all of the above principles.
Microservices and containers
Microservices are an architectural approach to application development where each feature is built as a standalone service and integrated together.
With microservices, apps are built as a collection of services, which pairs up perfectly with the distributed nature of the cloud. Each service can be hosted individually and treated as its own isolated ‘unit’ without needing to touch the rest of the application.
The benefit of this approach is that it becomes possible to build, test, and deploy individual services without impacting other services.
Even though a microservices architecture is more complex, especially at the start, it brings much-needed speed, agility, reliability, and scalability,
But managing your app as distinct microservices has implications for your infrastructure.
Every service needs to be a self-contained unit. Services need their own allotment of resources for computing, memory, and networking. However, both from a cost and management standpoint, it’s not feasible to 10x or 100x the number of VMs to host each service of your app as you move to the cloud.
This is where containers come in. They are extremely lightweight, and provide the right amount of isolation to make a great alternative to VMs for packaging microservices, enabling all the benefits they offer.
Functions and Serverless
Serverless is a computing model in which the cloud provider dynamically manages the allocation and provisioning of servers.
You only have to look after your code! Everything else happens invisibly and rather magically.
Serverless began mostly as function-as-a-service but has since expanded outside of functions to things like databases and machine learning.
It’s the “most cloud-native” that you can get because the back-end is entirely invisible and automated. It represents the lowest possible operational overhead currently available.
Serverless computing is a key enabler for microservices-based or containerised applications. It makes infrastructure event-driven, completely controlled by the needs of each service that make up an application.
The main benefit is that it allows developers to focus entirely on delivering value and enables them to release that value at pace via small, rapid deployments. Plus, the high degree of abstraction also allows you to completely break free from legacy architecture and processes that may have been preventing you from releasing at speed.
If you can jump right in with serverless, it can be a great option to kickstart your digital innovation. But don’t discount containers, which are very powerful and can still be very usable. And bear in mind that serverless and containers can be combined; it doesn’t have to be one or the other.
#11 Stateless Microservices Accelerate Change
With microservices, apps are built as a collection of individual services, that are each treated as their own isolated ‘unit’ without needing to touch the rest of the application.
The benefit of this approach is that it becomes possible to build, test, and deploy individual services without impacting other services. Therefore massively accelerating the development lifecycle, allowing you to make many small changes at speed rather than applying big, slow changes (see #6 [tiny commits]).
BUT: If your microservices have state and they fail that state will be lost. This will mess up your system.
If they are stateless they can fail you can launch them again and they will pick up exactly where they left off.
So if your stateless microservice is processing data from a database and it dies, then it will start up again and resume where it left off. Not so with a stateful microservice.
This makes your microservices more similar to functions/serverless. When a function is dormant they don’t exist! And when they start up they know nothing of the world. Which means there’s less stuff that can stop them working as intended.
Statelessness also brings confidence. You know when things go wrong in your house and you’re told to turn it off and back on again? That’s clearing state and starting from scratch again. If your microservice is stateless it always starts from scratch and because of that you know that it will run the same way every time.
#12 Observability (Monitoring, Tracing, Logging etc.) Lets You Know What’s Going On Everywhere
Cloud platforms offer an ever-rising tide of operability offerings.
There is a huge amount of out-of-the-box tools and services for monitoring, logging, tracing and so on along with in-built metrics for things like response times and failure rates.
This makes high-quality operability—knowing what’s going on, everywhere, all the time—the default!
Observability has two main benefits:
- You know when something has gone wrong!
Something’s broken? You get an instant Slack/SMS/email.
- Direct improvements over time.
You can’t improve what you can’t measure.
Do you know how long your CI/CD takes? We know we need a fast pipeline (see #4). With good observability you’ll be able to identify the ‘key constraint’ and get stuck in to improve it.
Don’t forget about your users in all of this. Make sure users are incentivised to give feedback.
#13 Distributed Tracing and Logging Track Problems Through Your Entire Tech Stack
The cloud is a distributed event-based system.
Distributed = the different parts are not in the same place
Event-based = every change to the system is recorded as an ‘event’ that can be consumed by other services/teams.
This has many advantages, but demands high visibility. In a monolithic application you can see what’s going on because it’s all in one place. In the cloud you need to know where all these distributed events are going!
You need to understand the consequences of something happening over here on something happening over there.
So one part of a system might make an API call that follows a path of different event-driven happenings before eventually resulting in an email being sent. If there is an issue with the email you need to be able to trace back the path of events to see where the error comes from.
For that you need distributed tracing and logging: i.e. understanding the path of events as they propagate through your system.
On one cloud adoption project at a large energy company, I set up a system so that every request had a unique request ID that could be tracked through the whole tech stack.
If something failed somewhere I would be instantly notified via Slack with the request ID along with time-stamped logs. I could see everything that had happened in relation to that request ID throughout the system...without having to do anything!
I could then instantly see where the error was. It would then take only a few minutes to deploy the fix and voila: prod is working again!
A massive tip to make the most of this is to make sure you have really good error messages.
I always say: an error message is like a gift to your future self.
A good one will not only tell you that something is wrong but why. For that reason it’s critical to log the context around the failure: the time and date, what decision preceded it, what data did I send back, etc.
#14 Cross-Functional Teams Ensure End-to-End Accountability
A cross-functional team is a group that features whatever expertise is needed from across the business to ensure end-to-end responsibility for a given feature or product.
So you’ll have a team at (say) a company that hosts a music-streaming service that is responsible for the recommendation feature. They’ll have a representative from UI, back-end, API design, BA, security etc. Everyone is - together - fully responsible for getting that feature to production.
This team structure massively enhances communication and collaboration while ensuring accountability from start to finish. The rapid productivity that results is really useful for cloud-native ways of working.
It stands in distinction to the siloed teams that have traditionally been the norm: developers, operations, security, business and so on each working in comparative isolation. The difficulty in effectively handing over work from silo to silo (and the abandonment of accountability) is a major cause of delays.
But how to get from siloed to cross-functional teams?
You can’t just take all your dev teams, slam them into the ops teams and go “now you’re cross-functional”.
You will destroy many lives.
The best way is to create one team of superstars that makes all the mistakes. They can learn everything: how to architect your systems, how to manage security etc.
You can use this first team as mentors that you seed into other teams to spread cross-functional ways of working. Individuals from the first team become leaders in the second, third, fourth teams.
#15 Choreography Makes for Resilient Event-Driven Architecture
Most systems are programmed as follows: when A happens, do B.
This is known as ‘orchestration’. It’s a common part of cloud-native architectures.
In this context, A is in charge of what comes next, and so A must be aware of B (and also of C, D, E etc.). So you have to constantly update A on the state of B, C, D etc.
Choreography, in contrast, gives B the power to decide when it does anything. Because of this you can add C, D, E, F all the way to Z without having to change A.
This means that A is always going to work, because you’re not changing it from it’s current working state.
It also means you can try C or D and see if it works, if it doesn’t you can then easily tear them down and move on to E.
This choreography is done using an event-based mechanism, it is used to distribute your system into microservices and functions by creating a system “heartbeat”. Things can hang off of that heartbeat and perform actions should they need to. The event system allows them to go down for periods (maybe for a release or bug fix) and when they come back online they can process events from where they left off.
Because they are so independent it means that they can be deployed in their own cycles, not waiting for other services to be released and having to sync up deployment times (one of the worst characteristics of a distributed monolith).
#16 Code for Resilience!
Coding for resilience means coding in a way that expects and anticipates failure.
In practice this looks something like this: if microservice A relies on microservice B and B goes down...A will not fail.
For example, ‘likes’ on Facebook are coded for resilience via an ‘optimistic user interface’. This means that when you ‘like’ a photo on Facebook, it will be registered on the front-end (by giving the little ‘thumbs up’) before it goes through to the back-end.
This means that even if the ‘like’ fails to register on the back-end, the system can keep trying until it goes through.
Critically, for the user it appears that the ‘like’ went through instantly regardless of whatever shenanigans need to go on in the back-end to eventually register the ‘like’.
The user’s experience is uninterrupted, despite a failure in the system.
And, ultimately, that’s all that matters: how your user feels when using your product. They want things to flow. So that engagement is maintained.
The frustration with systems is when users press a button and nothing happens!
#17 Code for Longevity!
Look at some code that you wrote five months ago.
If your immediate reaction is “Who the hell wrote this and what were they smoking?!”...you didn’t code for longevity.
There is a massive difference in your ability to understand the nuances of your code between when you’re in the thick of it and when you’re looking at it in a fresh state.
Coding for longevity means coding so that anybody can go back and understand it. This ensures that your code will sustain its usefulness even as personnel and systems change over time.
Peer reviews are an excellent way of ensuring that your code is legible.
#18 You’re Only ‘Done’ When It’s In the User’s Hands!
The definition of ‘done’, for almost everything, is when it is deployed and released to everyone in production.
Don’t settle for anything less!
Don’t spend forever pushing code to trunk that doesn’t have a prod environment with at least synthetic traffic running through it.
Even if your product isn’t released yet, you can have a prod environment with life-like traffic running through it.
The moment you decide that prod isn’t the definition of done is exactly when all your processes will start slowing down as conflicting incentives start to arise.
Your dev teams will report really successful sprints with loads of tasks completed, but your users will see no change to anything.
Remember our first principle: providing a service to the user is everyone’s ultimate goal!
#19 Practice Breaking Things on Purpose
Become the chaos developer.
Purposely code a (small and non-impactful) bug and push it to master…see how far it gets before it gets stopped.
(Make sure you have a good reversion strategy!)
Did it get to production? Did it affect end users?
Analyse why it happened, figure out how to prevent it from happening again. Document your findings and keep them for other teams to read and learn from.
#20 Ensure You Have an Exquisite Requirements Elicitation Process
There is no point in building something using feature flags and continuous testing in a fully cloud-native serverless environment with maximum observability using cross-functional teams…
...IF IT DOESN’T GIVE THE USER WHAT THEY WANT!
Finding out the real, nitty-gritty requirements is a difficult task. Often people go straight in for the design, imagining that they know what the user wants and needs.
Designs are not requirements!
A requirement might be that a user must be able to find a product on an e-commerce website within ten seconds (say).
A search bar would be a design to meet that requirement.
The search bar is not the requirement! Finding the product is.
All functionality must be able to be traced back to the atomic requirements that sparked its design and creation.
#21 Don’t Fear Failure; Or, Develop Effective Innovation Cycles
Effective innovation necessarily involves a few critical things.
One of those things is not fearing failure! Failure does not need to have an adverse effect on users (especially if you’ve followed #20!)
Another is to feel empowered to try new things. Then learn from your mistakes.
Finally (and most importantly), SPEAK TO YOUR USERS!
Invite them into your office. Develop with them sitting next to you.
Or even become your own user!
With these pieces in place, you can develop an effective innovation cycle that might look something like the below. Cycling between experiments, user feedback and learning; eventually ending up with a user-centric working product.
#22 Look After Your Team!
Your team has a range of needs: physiological, security, belonging, self-esteem and (if you’re lucky) self-actualisation. (Check out my earlier blog post The Developer’s Hierarchy of Needs! for a more in-depth examination).
Sadly, fancy offices with foosball tables and fun things on the wall, while all well and good, don’t really contribute to the complex needs essential for an engineer to really thrive.
Often, they just end up perpetuating an endless string of distractions that limits the attention span of developers, for whom attention to detail is of the utmost importance.
Here are a few ideas.
- Ditch the open office/hot desking! Give developers their own comfortable personal space to work in. Let them pick their own laptop/keyboard/mouse/monitor/mug warmer/slippers etc.
- Create natural collaboration spaces: You can then encourage intra- and inter-team collaboration by creating natural meeting spaces along the main routes that people are ‘forced’ to take in their movements around the office (by the kitchen/toilets for example).
- Build team relationships: build those relationships beyond the day-to-day. Take your team out to an escape room or cookery classes or go-karting!
- Determine values and vision: create a vision for everyone to buy into. Co-ordinate this with values that your teams aim to uphold and use to employ new members
- Create a team manifesto: co-create “rules” with your teams as a “manifesto” or working practises, things everyone expects everyone else to abide by. Review and update them regularly.
- Document the journey: appreciate that building a system or a product is a journey and, whether successful or not, that journey can become a kind of folklore that people can feel a part of. This journey may never end, new chapters are always being written and those members of your team are lead characters in it. Create a blog to document that journey, let everyone contribute and leave it accessible for those in years to come to read as part of their induction process.
- Appreciate that you don’t have all the answers!: Don’t be the alpha geek. You don’t know everything. Let your team adapt and find their own best practices. So long as everyone’s goals are aligned, your team will want to help make things better, but they won’t hang around long if you start barking orders at them. Coach them properly and show them the reasons behind the suggestions you are making.
#23 Take Your Time!
This is the final tip: take it easy! It’s a marathon, not a sprint.
The transition to cloud-native excellence is a long journey, not a sudden jump. Don’t expect your team to be able to do all of this from day one.
Slowly coach them through new processes.
Start by logging as much data as possible, lead times, Mean-Time-to-Recovery (MTTR), Mean-Time-to-Failure (MTTF), deployment frequency, customer satisfaction. Make it observable and watch it change over time based on the changes that you make.
Put these metrics on a screen on the wall of your office, make sure it isn’t a snapshot of the current state but a view of improvements over time.
Once metrics are on the wall, attempt to change the parts that bring the most value first. Get everyone involved in improving these metrics. Make it part of people’s objectives but also explain WHY you’re doing it.
Don’t just go around barking orders to do “trunk based development” to developers who have never done it before, using pipelines that are slow and insecure that deploy to multiple different environments before production.
And don’t delete all of your non-prod environments just because you’ve read this here. Try it out with your non-prod environment. Build synthetic traffic that best matches production and see how well you can deploy to it without causing failures.
Practice. Makes. Perfect.
Holy Cow. That’s a Lot of Stuff.
I think we’ll just leave it there. Good luck with your cloud-native adventures!