Chris Young

16 June 2022

5 Important Lessons On Data Migration You Need To Know

Migrating data from a legacy, on-premise estate into the cloud is one of the biggest challenges of enterprise data.

Data is entrenched in multiple places, it’s rarely clean, and it has become business critical in its known silos. Just like that key we don’t recognise in the back of the kitchen drawer, there are also sets of data we are sure must be important but just don’t know where they fit.

Data cleansing, domain mapping, and migration projects are long, dirty, and expensive. So how do we solve these problems regularly at Contino for our customers? Enter the key concept of handling big data itself: we break the problem down into tiny pieces.

In this blog, we will cover the process of managing a data migration project, what to look out for, and what you can and should discard in the interests of achieving your overall goals.

What Is Data Migration?

A data migration comes in a few different forms: moving an on-premise data warehouse to the cloud, moving applications and their associated data and bringing data together in a data lake or a data mesh approach.

An enterprise data migration project is often triggered by one or more of these three pain points:

Speed: The business feels it is slipping—or at risk of slipping—behind its competitors as its key data is buried in legacy systems, making it slow or impossible to perform analytics or to offer new data-centric products to its customers at pace.
Cost: The IT infrastructure is housed in an on-premise data centre, which has its own roadmap to reduce its footprint, meaning the applications need to be removed.
Access: Data is siloed in so many places that it is simply not practical to join it together for analytics, or to identify a single source of truth for key data. Controlling access to sensitive data is time-consuming.

You might recognise some of the following typical solutions to these problems that I regularly see when talking about data migration with clients: data warehousing, data marts, replatforming, APIs, cloud migrations, master data management (MDM), application rewrites, change data capture (CDC), data mesh, domain modelling…

Each of these has its place, but always remember it’s not what you implement as a solution that counts—it’s how you go about it that makes your data migration a success or a failure.

4 Benefits of Migrating Data

Some obvious and some less obvious benefits that I see as a result of clients’ data migrations:

Speed: cleaning out data, software, and processes makes you leaner and faster to react.
Reducing fixed costs: moving out of data centres and removing licences changes how you pay for your data and technology, which results in savings from that particular fixed cost budget (CapEx).
Talent: it can just be easier to find people who know or can pick up the modern technologies such as Python, NodeJS, Spark, Scala, and cloud computing technologies, that are common in data migration stacks, instead of trying to scale up teams supporting legacy code.
Answering new questions: traditional row-based relational data is really fast for serving the applications it sits behind, but is much less ready to support complex analytics in combination with other data.

Wait, that’s a pretty short list isn’t it? Well, yes it is. Don’t underestimate the value of these four benefits, but equally don’t overestimate what a data migration can do for your business.

Take a look at some of these reasons that I have heard clients state as driving their data migrations, and note the absence of business value mentioned. I often see these kinds of reasons for embarking on a data migration project, but they’re simply not quantified sufficiently to be real benefits:

Our legacy systems are too slow to provide new features
The existing systems are full of bugs
The data centre / central infrastructure team is too unresponsive to fast moving business needs
We need to standardise our data across n legacy systems
We need a single source of truth for a customer / brand / asset
This company needs to use ML and AI
Our data is all in siloes
I want us to handle streaming data
We have no concept of a domain in our data

Some of these are noble aims in their own right, but try to get to the route business value of these initiatives before spending time and effort on them (the five whys technique is great for this).

5 Lessons Learned From Over 20 Years of Migrating Data

1. Legacy Data Is Always Dirty. You Don’t Need to Clean it All!

I can predict, with pretty high certainty, that most of us will have performed a data migration in the last few years. I’m sure of this not just because you’re reading this article, but because you have probably changed your mobile phone in that time period.

What’s the first thing that happens after unboxing a new phone? At the very least, we sync our contacts from the old phone as well as photos, email settings, browser bookmarks, apps, and so on. This is straightforward today but do you remember when it wasn’t like this?

A few years ago, moving to a new phone meant importing contacts from the SIM, and also a lot of manual copying of other contacts and things like settings. Time consuming and annoying? Yes, but also a little cathartic as that naturally involved cleaning the data along the way—that phone number for Dave Plumber or the IT helpdesk at that place I used to work? No need to waste time copying those over! And so the data in my migration to a shiny new Nokia 7650 (Google it, possibly still my best ever phone) was cleansed along the way.

Today, I have contacts I will never make use of again, but guess what? It doesn’t matter. Data migrations to a new phone these days are better, even though I haven’t cleaned the data. They are better because they are fast and frictionless, and that is more valuable to me than a perfect data set.

Now, it’s easy to wrap this up in an everyday anecdote that proves my point, but let me put this in an enterprise context too…

Data created by and held in legacy systems—sometimes the result of M&A activity over the years—incrementally added to by countless application updates and often tightly coupled with business logic and application logic, is never clean.

This is important to understand: legacy data is always dirty.

In my 20+ years of shovelling data from one place to another, I’ve worked on a wide range of data migrations with varying levels of data cleanliness.

Examples of “Dirty Data”

A proprietary database where tables have rows only: The schema to parse the row contents was built into the application, all over it, including the data entry screens, and different modules within the application would use a different schema, meaning data in adjacent rows would follow different layouts. How easy is it to read data out of tables like that?
A database shared by different products on the same application: Changing the table signature, i.e. the schema, meant recompiling anything using it. The key tables for data such as customer and transactions simply couldn’t be changed without major disruption, so guess what the workaround was? That’s right, lots of spare fields (literally Spare Field 1, Spare Int 26, Spare Flag 6, etc)! But the different products using the database had different uses for those spare fields. The only way to tell what they were would be to trawl through the application code.
An instance per customer of an application, with configurable use of the database: each customer effectively had a different database schema within the same application.
A vendor-supplied daily data extract which would change schema retrospectively: data loaded yesterday could be re-sent with different columns.
Not to mention the plain old data quality issues: the customer who dared to have three-digit age in a table that only allowed two; the customer addresses containing backslashes, commas, any special character you can think of; the vain attempts to list all possible titles a person might expect to be on their correspondence; the columns filled with invalid dates; US vs European date formats…

Each of these has presented considerable challenges when it comes to accessing or migrating that data from outside the original application.

Cleaning up most of these data sets at source would break those systems, given the very understandable fact that the applications using it were built around the state of the data. So don’t worry about cleaning all your data—sometimes it’s just not possible.

Data Maturity in the Public Cloud: Research Report 2021

We asked 272 IT decision-makers from around the globe about the state of data maturity in the public cloud in their organisation.

Download this report to get useful insights and the benchmark metrics you need to help drive a successful cloud data implementation.

Get the report

2. You Don’t Need All Your Data

So what do we do with the data in these kinds of states? The answer is to focus on what is meaningful data. By all means take “Spare Field 6” and its value of ‘3’ or ‘S’ with you when you migrate the data but if you don’t need it to meet your goals, it’s ok to leave it behind.

The easiest way to get to your goals faster is to do less on the way there and prioritise the data you need to spend less time migrating it.

The breakthrough on getting data out of the system where each customer had a different database schema was to simply focus on the bare minimum of important data fields needed. We didn’t need the customer’s transaction and product history. Sure, it would have been nice, but the benefits couldn’t justify the cost of making it work.

What was critical though, was the customer’s current contact information and their current warranty state. Accepting the cut-down data set meant the problem could be simplified. This meant fewer fields to worry about and as they were mostly central parts of the database, the proportion of fields that had been customised by the customer was lower.

Key questions to answer before you start:

What data do you need to meet your goals?
What data can you leave behind (this could be for now or forever)
And crucially, what are the goals of moving this data? Remember this has a business driver—it is not a technology decision

3. Not All Data Is Equal

I keep my passwords secure with a password manager and I keep my passport in a safe place…but anyone can see my employment history on LinkedIn. These things are all my data, but I treat them differently depending on the type of the data. Likewise, successful enterprise data migrations come from understanding how to treat different types of data.

Information surfaced for customers of your business, maybe ingredients in the food you sell, is vital to get right. The internal job history of your employees might be less so. So focus effort appropriately: make sure those ingredients listings are clean and up to date and spend time verifying this, but perhaps you can copy the employee history in bulk as unstructured data. It’s available for analytics at some point in the future but you haven’t used valuable time making it look perfect at the migration point.

4. Take an Iterative Approach

Data migration projects attract a waterfall delivery mentality, perhaps because they can have relatively clear start and end points, and are often about solving a known set of problems. This leads to large programmes of work, following the well-trodden path of data ownership > cleansing > domain modelling > MDM > replatforming. But it’s these traditional steps that lead into a quagmire of never-ending data analysis.

How to turn a waterfall project into an Agile workstream:

Focus on the value the project wants to deliver
Understand the use cases involved and pick one to work on
Introduce a culture of experimentation
Make this one use case your Lighthouse project

Here at Contino we make a lot of use of Lighthouse projects. They are short-term, high-value projects that crucially show the value of what you're doing by producing results rather than telling via PowerPoint.

You can learn more about Lighthouse projects here!

Example Client A

One client I worked with was involved in a multi-year project to move to the cloud, with an up-front architecture for data governance, tracking lineage, and data science segregation of activities between departments.

They were very thorough in the design, but deeply mired in analysis and concerns about cleaning data. Twelve months later, this enterprise was still designing and had no actual data insights ready for business consumption.

Example Client B

Another client set out a high-level cloud-first strategy and began onboarding data use cases from the start. They used some Jupyter notebooks and developed more and more features over time, as more use cases were brought on.

Twelve months later, and this enterprise had the beta of a cloud data science and analytics platform already turning out results for its internal customers.

Time and time again I see that the iterative approach gets value in front of your business sponsors and users faster and allows you to drop ideas and features that don’t add value, but lets you still follow an overall strategy and build fantastic quality and security into your solutions, and this is what we at Contino is famous for.

5. Breaking Up Is Hard To Do

I’m going to mention understanding the business value of the project again. It’s that important. If your migration has a replatforming element to it, and if the value to the business is in part the cost savings in moving away from legacy systems, then there’s one important thing that must happen to realise that value.

Be brave.

Switch off the old system, switch off those servers, don’t renew the licences. These are factored in as significant benefits and drivers, or at the very least are mitigations for the cost of migration. And yet, with the exception of one investment bank, enterprises I have worked with find this the hardest part.

But there’s a question to answer here: how do you recreate a legacy system elsewhere?

Legacy is an easy term to use, but we’re not talking about an old pair of jeans we can replace immediately. Legacy software and databases actually mean years—likely decades—of feature-rich development, of bespoke design, of customer tweaks and targeted BI, of integrations to other systems, suppliers, clients etc. So how do you migrate all of this?

The answer of course is that you can’t, and you can’t expect that your project will be the exception. To drive the efficiencies of redesigning how you work with data, tough decisions on what comes with you, and what you say goodbye to, need to be made up front.

Clients I talk to about this find stakeholder buy-in hard in a typical enterprise landscape as the concept of a single, wise data owner for a domain is rarely seen in the real world.

Getting agreement to a grand switch off, where some parties are naturally going to be disappointed, is a painstaking exercise. This is why your complex data migration project needs an executive sponsorship that will drive those conversations to conclusion.

In Summary

A data migration on its own will not solve your business’ problems. Focusing on these key takeaways will help you make best use of a migration, and give best value to your business:

Truly understand the business drivers
Make sure you have genuine executive sponsorship to smooth those enterprise politics
Leave data behind if it suits your use case
Clean the data you absolutely have to, leave the rest as smelly as it comes
Beware of unquantified logic driving decisions such as data standardisation programmes
Lean towards answering new questions rather than rebuilding what you already have
Start small and scale up gradually
Be brave in switching off what you have replaced
And remember that speed wins

What’s your opinion? Do you see successful data migrations work with some of the elements I suggest removing, or are you struggling in the middle of a stagnating migration right now? Get in touch!