How to Use AWS Glue Catalog to Empower Your Modern Data Governance

As a product owner, you want to determine the feasibility of a potential key new feature of your application by browsing its existing metadata. As a software engineer, part of your user story requires querying and then persisting data to a SQL database. As data engineer, you have been tasked with creating an Extract, Transform and Load (ETL) process to transfer information from CSV files to a new NoSQL data store.

You fire up the cloud console of choice, read through various team and company wikis and then…the enthusiasm starts to drain. There is plenty of documentation, good code quality and data stores available, so what is the problem?

Data governance is often essential to support the data management strategy of an organisation. At its most fundamental, governance does just restrict access to data, but provide consistent guidance on its original intent, nuances and considerations for use. However, such an important function is often hampered by a proliferation of differing, isolated processes and reporting, which becomes increasingly problematic with ever-increasing data collection. We often see people in the aforementioned roles and others struggle to understand even the most basic data questions, in organisations of all sizes across all sectors, usually resorting to time-consuming quests seeking and consulting various domain experts and occasionally receive differing, outdated and conflicting advice.

It is not always often obvious to people closest to the detail that data discovery in particular is a problem, however. Other tell-tale indicators include:

Lengthy lead times to delivering data-related application features
The perception of Information Governance (IG) and a blocker, rather than an enabler
Expensive data onboarding processes
Difficulty in quickly and confidently answering data-driven business questions

AWS provides various services that can facilitate and automate consistent data governance functions and scale with demand. In this blog, we will focus on two of these functions: data discovery and data quality.

What is Data Governance?

The Data Governance Institute defines data governance as “a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.” It is traditionally applied to well-defined datasets, but as noted in Forbes, will increasingly need to apply to semi and unstructured data. Stakeholders also now expect new products and services to use modern cloud-based features, such as serverless and Infrastructure as Code (IaC), to set up and scale on-demand to quickly add value.

There are excellent enterprise-level products that can facilitate data governance, such as Collibra and Atlan, but they often require dedicated groups and a concerted effort to adopt consistent usage, and are often added later, rather than adopted from the start. Rather than delay data governance efforts to prepare for enterprise-level adoption, start small with tools provided by cloud vendors: we will focus on AWS here, but try Azure Data Catalog and Google Cloud Dataplex if your data resides there. With the right access, you can create a catalogue in minutes. You will often be shocked at what data crawlers find, and will struggle to resist the urge to immediately iterate, improve and reap the rewards. Cloud-native data catalogues can often be integrated directly into cloud-agnostic, enterprise catalogues, building on your great start and accelerating best-practice data governance, particularly in a multi-cloud organisation.

How Can AWS Glue Improve Data Governance?

AWS Glue is a “serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.” It is simple to set up from the AWS Management Console or manage in an IaC tool such as Terraform, and provides multiple features that can empower various data governance stakeholders. Let’s use this simple architecture to show how healthcare data could be managed:

The initial challenge with data governance is understanding what data is collected. AWS Glue Catalog can capture the metadata of multiple different data sources—SQL, NoSQL and unstructured data—and keep updated with changes using crawlers. Custom database connections can also be used, so you can build up a complete picture of your data portfolio in the Catalog. The metadata for each table can be supplemented, enabling Information Asset Owners (IAOs) to provide extra detail to consumers, such as the lag time of the data, its update frequency and known omissions.

With data available for browsing in one location, we can standardise other features of its management. Understanding the Data Quality (DQ) of a dataset enables potential consumers of it to make informed decisions as to whether it is fit for their use case. Tracking changes to that DQ over time enables IAOs to work with the providers and collectors to ensure it continues to meet agreed Service Level Agreement (SLA) thresholds. The application of DQ rules and the reporting of them can vary per dataset with inconsistent approaches. AWS Glue Data Quality is currently in Preview, but this feature enables DQ rules to be created and run per data table, using the same serverless approach for every table in one product.

One of the most helpful features is that Glue can create an initial set of rules for you by crawling the table and using sensible presets against what it found. The data in our example architecture has few, well-understood fields, but there are plenty of instances in the past where I have inherited datasets, much like code, with little documentation and vaguely named and inconsistent fields and processes. Quality Assurance teams could use Glue Data Quality this to create a baseline of rules against the data at the time, to ensure updates to it at least meet those standards, then seek to improve it over time as understanding of it grows.

As you can see from our example, quality has decreased on a subsequent run of the rules. This is because rules are versioned and, in this instance, we tightened the criteria further to warn of potentially invalid values for date of birth and height. This audit trail would be helpful in limiting concerns over changes to data quality over time, as new rules agreed with IAOs and providers appear to reduce quality, whereas it is instead improving it. It would be great to see rulesets composed of other rulesets, to enable sharing of common quality checks, with changes to those checks automatically propagating to all use cases.

The same Data Quality rule process can also be used in AWS Glue ETL processes. The same helpful editor and code editor are available, although again, rule sharing would enable reuse of well-defined, tested data checks.

How Can Compliance be Incorporated?

Consistency is key to simpler data governance, so let’s ensure that every instance of specified AWS database types have associated crawlers. This can be achieved with an AWS Config custom rule, which can also be triggered when any relevant AWS services changes are made, such as when a new DynamoDB table is created or an existing one modified:

The same approach could be applied to verify Data Quality rules are present for every table catalogued in Glue. The audit trail enables compliance teams to ensure that historically non-compliant resources have been resolved.

The Ultimate Guide to Serverless

Due to its rapid adoption, propensity for rapid delivery, and the potential for a competitive advantage, the world of serverless has evolved considerably in the last ten years. In our new white paper, we discuss how large enterprises can start their serverless journey.

The Ultimate Guide to Serverless will take you through:

The different types of serverless offerings from AWS, Azure and GCP
Reasons why serverless makes sense in an enterprise (and when it doesn’t)
The challenges of serverless and how to overcome them
Monitoring and observability of serverless applications
How to get into the serverless mindset way of working

Get the white paper

How is Access Secured?

The same access patterns and controls that are used to collect and store your data can also be applied to AWS Glue. AWS Identity and Access Management (IAM) policies can restrict any aspect of Glue, from enabling read-only access to data from specific service instances to controlling which user roles can browse the Catalog. The metadata in the Catalog can be encrypted at rest using AWS Key Management Service (KMS).

What Next?

Although Data Quality for AWS Glue is in Preview, it is available in several regions globally now, so if you believe it could add value to your organisation, I highly recommend evaluating it. As with other AWS services like Glue Catalog, you can start using the Console, and import instances into an IaC tool later as the solution matures.

Depending on your challenges, there are plenty of other AWS services available to facilitate continuous, consistent data governance across your cloud estate. Regardless of your governance maturity, it is always worth reviewing it against the AWS Cloud Adoption Framework. Look out for follow-up articles on the following related areas and applicable AWS services in the future: