Marcus Maxwell

26 March 2019

8 Factors That Make for a Legendary Platform Team

In most enterprise transformation journeys, there comes a time when it makes sense to start centralizing services used by multiple teams within a single platform team. This can be due to licensing issues or to lower the operational overhead that results from every team having to maintain their own stack.

Either way, it is a big decision.

Because, once you centralize, the SLA for the service becomes a lot higher and any outages cascade to multiple teams. What’s more, it is important that the central offerings don’t become the defacto toolset. If a team wants to switch from your centrally-managed Jenkins to a SaaS offering (CircleCI, Buildkite) then they should be allowed to do so.

So how can you ensure that your new platform team delivers?

This article is a high-level overview of a few of the key success factors that I have noticed in my experience working with many internal IT systems in the enterprise.

1) Make Sure You Centralize the Right Services!

The first step is to make sure that the services you want to centralize are suitable.

In my experience, the following services make sense to be owned by a central platform team. They are proven to scale well and provide a lot of collaboration opportunities.

SCM (Github Enterprise/Gitlab)
Artifact storage (Artifactory, Nexus)
Knowledge and task management (Confluence/JIRA)
Code quality and security scanning tools (SonarQube, Veracode, Aquasec/Twistlock)
Management of central SaaS offerings (Slack, 1password)

There are other services that are very challenging to centralize. The following are better served by providing a set of guides, best practices and deployment examples.

CI/CD systems (Jenkins, TeamCity) These are relatively easy to set up and are used daily by the dev teams, so even if you have a central one you will have to delegate most admin tasks to the team.
Platforms (Kubernetes, serverless offerings) Do provide guard rails for the platform, but don’t shoehorn a team into a specific way to use and grow it.
Most developer tooling IDEs, text editors, terminals, linters, libraries, test frameworks etc. should be chosen by the development team based on their needs.

2) Build the Platform with the Customer

One of the biggest mistakes most organizations commit is building a platform without talking to the prospective users and then suddenly announcing: “Here is a DevOps platform, go use it!”

This often leads to a lot of frustration and pain from the start. It shoehorns developers into the constraints of the platform and its various quirks. Instead of having a platform that works for them they end up having to consume a platform that requires advanced acrobatics to navigate.

Building the platform with the customers means having quick feedback loops on everything that your team is building. It is about being lean and not over-engineering.

3) Customers Will Require Support, Be Ready

Ensure your support teams (L1, L2, L3) are rock solid.

Your L1 and L2 are the first teams that will encounter the bulk of new users. This is where most of the issues with documentation and process can be found and where new requests come in.

What happens there reflects on your team as a whole, so a customer-facing mindset is critical.

Here are recommendations that can help you to achieve this:

Reduce the amount of time spent filling out forms This applies to your team as well as your customers. Try to reduce the burden of filling out forms upon forms of data in hopes of then doing some analytics, instead try capturing most of the data as it gets received by doing lookups on what is in the request and from whom it comes from. Particularly in the early days you should lean more on a triage then hoping people will fill out forms correctly.
Make it clear when and how to escalate Quite often a ticket will get stuck for days because someone is trying to fix it themselves without asking for help. This means you haven't explained to the team that they should escalate. Most (effective) ticketing systems support notifying L2/L3 if a ticket has been open for longer than a day.
Don't take your L1 for granted In many cases L1/L2 are outsourced, that doesn't mean that it’s OK to treat the team as anything less or give them the most mundane of tasks to do over and over. What you will end up with is high rotation of L1/L2 and spend a lot of time training up new people, which is unfair to the team and your users. On the same note, you need to ensure that L3 is constantly getting feedback from L1/L2 and pairing with them to automate the procedures that occur most often or resolve them entirely from happening.

4) Make Life Easier For Yourself: Adopt Self-Service

Try to ensure everything is self-service.

In an organisation of over 200 application teams (or more!) there is no way you can provide a quick and effective service for everyone if they need to raise a ServiceNow ticket or email your team every time they request something.

Self-service doesn't mean you need to have everything available to request from an API(or at least not from day 1). It might just be a Confluence page with the precise steps to follow to order a database, for example. You can then iterate on that and over time automate more and more of the documented steps. The biggest issue is that a lot of people in the organisation don't know how to get something or what is even available. Here are some practical tips:

Have a high-level wiki page for each product This should provide an overview of everything that is available for that product. For example if Github Enterprise was being provided centrally, you would set up a wiki page with details on how to get access, create a team, set permissions, share a repository with the wider organisation, setup hooks into other tools like Jenkins etc.
Constantly get feedback from your customers Where are they struggling? What takes them the longest? What would they like see improved? Don't send out a questionnaire, go and see them in person.
Avoid creating an API of APIs Quite often there is an urge to abstract a difficult-to-use API with another high-level one(for example, creating an API for ServiceNow that then calls another API to add a user to an AD Group). This often this leads to reduced functionality and reliance on yet another API with its own specification. Try to ensure native APIs are as usable as possible otherwise revisit the product you are using underneath.

5) Keep L3 Product Owners Happy and Productive

Make sure your L3 product owners are happy and are constantly being challenged.

Once your have been running your stack for a year or two, you will probably find that it is a lot less work to maintain the platform, particularly if you nailed down the documentation, automation and have spread the knowledge about the tool to others in your team.

This is where you have to step in to make sure to ensure that the product owner for that tool has other avenues for growth. So if they maintain your CI/CD tool, maybe it’s time for them to get more familiar with the SCM. Or if everyone on the team knows the tools pretty well, then start exploring migrating to containers or reconsidering the tool altogether and migrating to a different one.

It is incredibly challenging to find someone who knows your systems inside out, has the personal relationships inside of the organisations to debug across the stack and is an expert in their particular tool. There is no handover you can do that will resolve this. The only way to mitigate this problem is to have more than two people at all times engaged in maintaining a particular product.

6) Communicate! PR! Marketing!

Don't forget marketing!

Internal Platform Teams usually forget or don’t give attention to market what they are doing. Quite often they will update a tool and won't even send out a changelog of all the new features that are available. Make sure that any time you upgrade a tool, release new functionality, patch some systems that you send out comms to users that explain what they stand to gain from the new functionality, e.g. “You can now use containers for your builds!” or “Github Enterprise now integrates with our internal JIRA! Huzzah.” Make sure to also include the How To with your newsletter, so that users don't have to email you to ask how to use the new functionality.

"If a tree falls in a forest and no one is around to hear it, does it make a sound?". If nobody hears what you are doing, then you will get your funding cut and nobody will understand why your team exists.

7) Use Retrospectives and Feedback

Re-evaluate your decisions on a regular basis.

Quite often a team will keep on using the same tool for years even though the industry has moved on and the users are demanding other options.

Do not succumb to the ‘sunk cost fallacy’: if it’s not fit-for-purpose, move on. What worked a year or two ago, might not work any more. Our industry moves at quite a rapid pace, vendors come and go, and if your users have new use cases that are not catered for with your tool then you need to evaluate new options.

It doesn't matter how fast your car is if everyone has moved on to flying cars!

8) When Problems Arise, Stay Open

If there is an outage, be proactive and open.

An outage is very stressful to your team and to the users. Nothing works and everyone starts to complain and panic. At this time the most important thing is not solving the issue. The most important thing is how you communicate.

Are you regularly giving updates and communicating when another update will go out? If not, then users will feel ignored and will get progressively more angry and impatient. Once you have found the issue and resolved it, it is equally important to have a post-mortem that is honest and includes what actions you can take to prevent it from happening again.

Get in Touch!

These are only a few of things I have observed that need to be taken into account when running an internal Platform team, but of course there are many others that I haven't covered: how do deal with scaling to terabytes of data? how do you deal with the vendors? how do you handle upgrades? how do you handle backups and DR scenarios?

If you are interested further in this topic or have any questions please let me know!

8 Factors That Make for a Legendary Platform Team

More Articles

Nephophobia: Why We All Fear the Cloud

Convincing the Business to Go Serverless with an Agile Lighthouse Project

International Women’s Day: A Conversation with DevOps Expert Ebru Cucen