25 Data Science Best Practices to Deliver Ongoing, Reliable Business Value
Data science is an ever-changing beast! New packages, models, tools and platforms are constantly being released. It is easy to get caught up in these developments and neglect the stable best practices that endure independent of these changes.
In this article, I’ll cover some useful tips and tricks on how to efficiently derive ongoing and reliable value from data science across three key themes:
- How to ensure that data science projects deliver actual business value
- How to build and deploy robust machine learning models
- How to deliver a project efficiently, without wasting time or accumulating excessive technical debt
1) Deliver Actual Business Value By Being Outcome-Focused
Even if you’ve built a great machine learning model, if the output from that model is never used by anyone then the model is not delivering value. There are three crucial steps to ensuring a data science project delivers value:
- Foster stakeholder engagement
- Align with the stakeholders’ aims
- Consider how the output from modelling efforts will be used
At Contino that means working with the stakeholders who will benefit from the outcomes and have the influence to drive broader adoption and transformation to ensure impactful outcomes.
Foster Stakeholder Engagement
Getting stakeholders engaged is crucial so that they can ensure the project delivers what they want and can then drive broader adoption of the outcomes. A simple step towards fostering this engagement is to identify your stakeholders and then run frequent (e.g. weekly) update sessions with them. These sessions should be transparent and offer:
- New insights to the stakeholders throughout the project (e.g. into their customers’ behaviour)
- An opportunity for the data scientists to better understand the industry and the data
- An opportunity for subject matter experts to sanity check findings
- Constant feedback to adapt the project, drive new ideas and ensure value is being delivered.
- Greater data science understanding to stakeholders, enabling them to interpret and disseminate findings
Align with the Stakeholders’ Aims By Using Business Metrics
To ensure a project is aligned with stakeholders’ needs, we should try to understand the problems/opportunities a business/organisation is facing and the metrics they are trying to improve. These metrics should form the backbone of the project.
Measure key metrics early on
Before beginning any machine learning work, measure the core business metrics so that they can then be tracked following the project to see whether they improve. For example, if engaging in a churn prediction project to improve customer retention, you would want to measure the current retention rates, the cost and success rate of previous retention efforts, etc. This will then allow you to measure whether the project has helped improve these metrics, guiding whether the project should be scaled out.
Test for improvements in these metrics using experiments
Tracking whether these metrics improve over time should not be done passively. Metrics will naturally change, such as due to seasonality effects. Experimentation is needed to ensure that changes are due to initiatives rather than these other factors. For example, if deploying a model to recommend new products to customers, only expose this model to a subset of customers to see if their purchasing patterns change compared to a control group of customers who were not exposed to the new recommendation engine. This allows you to accurately estimate the impact of the initiative and estimate the likely return on investment.
Incorporate business-specific metrics into the objective function when modelling
Typically, supervised learning models are set to optimise business-agnostic metrics, such as accuracy (what proportion of cases are predicted correctly) or loss (how far off the true outcome was from the predicted outcome). The downside of this is that you may then be training the model to perform well at predicting something the business does not particularly care about. If missing a case of fraud costs 10x more than incorrectly labelling a case as fraudulent, this should be specified in the objective metric during model training. Depending on business needs, other factors such as model robustness or interpretability can also be considered when deciding on an objective metric .
Use the Output to Create Business Value
As a final point on business value, a common failure point for projects comes once the data science work is done. While standalone deliverables are great for the stakeholders directly engaged in the project, they often never progress to other users. By integrating the deliverables into existing processes early on, adoption of the deliverables can be made frictionless, enabling far greater value to be derived from the project through broader adoption.
2. Build and Deliver Robust Machine Learning Models
There’s a large difference between building a machine learning model that performs well at predicting data in a test set during development and building a model pipeline that will continue to perform well at predicting data 12 months from now. How do you ensure your model is robust even to unpredictable changes in the data over time?
This can partly be achieved through focusing on model robustness in addition to model accuracy. Simpler models, for example, may be more robust due to fewer moving parts. But even the simplest models can drift and become less relevant over time. This is why it is also important to have tests (ideally automated tests) that monitor your models, identifying when issues occur that could degrade performance, enabling corrective action to be taken. Here’s an overview of some of the many things data scientists should be testing for.
Test for data quality
I’d consider data quality to be the most important thing to test for; models trained on unreliable data will produce unreliable results. Automated tests can be set-up for common data quality issues such as incomplete data or anomalous discontinuities in the distribution of data. Nevertheless, data quality issues can be highly situation specific. The best way to check for these rarer data quality issues is to inspect visualisations (not just summary stats) and present them to subject matter experts to catch issues that the data scientists might miss.
Test for data freshness
Data freshness refers to how recently your data has been updated. What happens to your model when it is deployed and it turns out that some data is only updated every week or month? When you trained your model, you assumed it would have up-to-date data but now it is relying on outdated information to make predictions, potentially degrading performance. For example, a model to predict customer churn might perform poorly if transaction data is a month old because it is no longer able to see which customers are actively transacting and which have lapsed.
During development, you should test the data freshness requirements of your model by explicitly cutting out the last day, week, month etc. from the test data and seeing how performance degrades. Once the model is deployed, have automated tests check that the data has been updated sufficiently recently.
You can also test for whether your data is partially lagged (e.g. sales staff backdating sales that they only enter at the end of the week) by recording snapshot summaries of your data and seeing if these summaries change at later dates. This can help ensure all of your data is up-to-date, not just some of it.
Use data schemas
When writing your data cleaning pipeline during development you probably made some implicit assumptions about what the data would look like; sales values can’t be negative, age was recorded as an integer (no decimals), you assumed no missing values etc. You should explicitly state these assumptions in a schema and then validate that the data follows that schema in the production model pipeline.
If you believe your model should be robust to certain assumptions being violated (e.g. it should be able to cope with ages that have decimals) then explicitly test this during development by feeding your model fake data that violates these assumptions.
Test for data skew / concept drift
Even if your data follows the schema, data will inevitably drift over time. Patterns that your model learnt during training may no longer be relevant. For example, customer behaviour during an economic boom may look completely different to their behaviour during a recession. Test for data skew by storing a set of summary statistics, i.e. mean, standard deviation, etc., about the training data each time the model is trained and then intermittently checking that the data being fed to the model in production shows similar summary statistics. It can also be worth testing for whether the distribution of predicted values changes drastically, which can be indicative of model issues or genuinely important changes in the likely distribution of outcomes.
Monitor your models to evaluate performance
As data drifts, model accuracy will tend to degrade. This is why it is important to reevaluate the performance of models on recent test data at frequent intervals. Trigger retraining of the model when performance drops. If retraining does not improve model performance sufficiently, this is a trigger for a data scientist to intervene.
Use train-validation-test splits and cross-validation
This is the test that data scientists will be most familiar with. If you haven’t come across these concepts, basically when training a model we want to hold out some data so that we can then evaluate the model on data it has not seen during training. This is a simple yet powerful way of checking that the model’s predictions generalise to other data rather than learning the patterns of a specific data set (overfitting).
Identify data leakage
Another concept I would hope all data scientists are familiar with is information leaking into the training data that should not be there. If the model is accessing data during training that it will not have access to in practice, performance metrics will be misleading. Data leakage can occur in a lot of different ways so testing will need to be context specific. Any likely sources of information leakage identified during model development (e.g. duplicated cases in test and training set) should ideally be coded into automated tests when the model is in deployment.
Don’t overlook other requirements
If the project has other requirements placed on it, ideally you will also develop automated tests to ensure those requirements are being met by the model. For example, if your model could negatively affect certain demographics through biased predictions, you should include tests for this, such as testing whether the correlation between these demographics and the predicted outcome is above a certain bound. Of course, algorithmic bias can not simply be resolved through automated tests alone and is a topic worthy of its own articles, books and organisations.
3. Develop Models NOT Technical Debt
Efficient model development involves delivering outcomes within reasonable time frames without also developing excessive technical debt. To run truly efficient and impactful data science projects, business and organisations should be adopting DataOps and MLOps practices. Nevertheless, there are steps data scientists can take to make their projects more efficient.
Set up rapid, trackable experimentation
Before you start comparing any models, set up a pipeline that automates the data prep and model evaluation process. You should then be able to specify any model within this pipeline and immediately be able to evaluate that model. Everytime you evaluate a model you should automatically track (e.g. using mlflow, Kubeflow or Sagemaker Experiments) information about what the model was, what data it was fed, the hyperparameters and the performance. This allows you and others to later look back at what has been tested and which changes improved performance.
Store model pipeline artifacts
If your model pipeline takes a long time to run, you won’t want to run the whole pipeline again if there is an error. Store artifacts produced during the pipeline (e.g. cleaned data, model weights etc.) so that you can resume the pipeline from failure points or rollback to previous pipelines when needed. You should also record meta-data about these artifacts so that you know when they are outdated and can version control your data. Storing cleaned data can also reduce repetition of work by making the cleaned data available for other projects.
Ensure that development pipeline = production pipeline
To the maximum extent possible you should work to have the pipeline being used during development exactly mirrored in the pipeline used in production . This means that when a new model is developed it can be immediately deployed by pushing it to the production pipeline.
Plan for handover
If the data science project is put in production, chances are that further development or adaptation will be involved at some point in the future. If your code is not human-readable and well documented, the person responsible for this later work will read the existing code, fail to make sense of it and begin again from scratch or misuse the code due to misunderstanding.
Be wary of silent failures
Data analysis code frequently fails silently (it appears to run without any errors when in reality, some calculation is being performed incorrectly). This is why it is so important to keep code simple and interpretable. Additionally, it is why code reviews are so important as a means of identifying when code is not doing what it should be.
You might think that the categories of a variable are not going to change and so figure it is safe to hardcode that assumption into your code. Or you might figure a visualisation is just needed once-off so you might as well specify the values plotted in that visualisation rather than writing code to derive those values. Murphy’s law states that the inputs you decide to hardcode are exactly those that will change and need updating.
Profile your code
If you notice your code is running slower than expected, it can be worth profiling your code to identify how much time is spent on each section. On the topic of efficient code, I’d strongly recommend avoiding looping over your data (yes, that includes apply functions); vectorised methods will almost always be possible as a much faster and more scalable way of preparing your data.
Try an autoML offering
There are a range of packages (e.g. H2O, TPOT, Sagemaker Autopilot) now focused around automating the model selection process by testing out a whole combination of models and hyperparameters and trying to select the best performing one. In my experience these autoML packages don’t give top-level results but they’ll often get close enough. If you are on a really tight development timeline and small increments to model accuracy are not crucial, give an autoML package a shot.
The tips and tricks covered in this article should help bridge the gap between just completing a data science project vs delivering ongoing value from a data science project. While I’ve focused on the tips and tricks I consider generally important, this is not a complete, didactic set of best practices. Different data science projects will have different needs and different aims. So my broader piece of advice that underlies all of the tips listed here, is to focus on how your data science project will deliver value and work backwards from there.