Amazon Redshift: Overview, Compatible Services and Use Cases
This is part 3 of 3 in our series on Amazon Big Data Tools. See also Part 1 on Amazon Athena and Part 2 on Amazon EMR.
What is Amazon Redshift?
Redshift is Amazon’s relational, OLAP-style cloud data warehouse product that was built for scale. It is designed specifically to fully manage and query huge datasets faster than the average warehousing product, utilizing Massively Parallel Processing (MMP). The product is best suited to complex big data analysis applications that require insights now.
Redshift allows you to combine and structure data from disparate sources like financial, customer, and sales data, to then analyze as a whole. While it’s not always a necessary tool in your pipeline, if you’re serious about big data analytics then it might just be what you’re looking for.
Under the hood, Redshift utilizes a cluster structure, comprised of a leader node (which communicates with your application), which manages compute nodes, each containing their own slices. This structure utilizes a columnar database, has been constructed specifically to handle massively parallel processing. Amazon can recommend a cluster configuration from the Redshift dashboard, or you can dig in and set parameters yourself.
It’s important to follow Redshift best practices to ensure correct data loading, optimal cluster configuration, fast query times, and minimal spend on the platform. Redshift is only performant given the right set of circumstances and configuration. Storing all your historical data in Redshift, for instance, would be a waste of resources. That’s best left to a data lake like S3, which can be moved to Redshift as needed (or accessed directly from S3 via Redshift Spectrum).
Benefits of using Amazon Redshift
NEAR REAL-TIME COMPLEX QUERYING ON MASSIVE DATA SETS
This is the clear benefit of using Redshift as your data warehouse. Redshift is specifically designed to handle massive datasets (think: a petabyte plus) and also the reason why data-centric organisations like Dow Jones, Yelp, and Guardian News and Media all use the service as part of their tech stack.
HIGHLY SCALABLE CONCURRENCY
Redshift prides itself on its massive concurrent workload scalability. The relatively new Concurrency Scaling feature means that clusters can be auto-scaled, rather than over-provisioning in anticipation of high workloads, or going with best-fit provisioning which may increase processing times in times of higher workloads.
FAULT TOLERANT AND BACKED UP
In-built monitoring means that even if cluster drives or nodes fail, data is replicated for fault tolerance. Data is auto-backed up to S3 for disaster recovery, too.
SUPPORTS STANDARD SQL
Create queries easily without the need to learn another query language. Note that some functions run exclusively on the leader node.
ON-DEMAND CLOUD-BASED ARCHITECTURE
Redshift does away with onsite server infrastructure required to handle querying petabytes of data quickly. It’s not always practical to have this capacity onsite, or when you do, query times may be slower due to infrastructure limitations.
INTEGRATE WITH EXISTING DATA SOURCES AND BI TOOLS USING AWS PARTNERS
Yes, it’s possible automate the Extract, Transform, Load (ETL) process using Redshift with your existing data sources and BI products. For instance, if you want to run your own analytics on your Salesforce data you can use Blendo as your connector then use Tableau for your BI. Check out the list of Redshift Partners here.
Complementary AWS services
As with most AWS products, there are a variety of helper products that can give you a more streamlined end-to-end solution.
AMAZON S3 + REDSHIFT SPECTRUM
Amazon’s foremost data lake product, Amazon S3, can be used to store all applicable business data in the cloud. Instead of loading this data into Redshift, you can query it directly from S3 when you use the Redshift-included service Spectrum. This eliminates a fair part of the work involved in manually configuring the ETL process. Added bonus: you can even join data directly from S3 with data already in Redshift.
Instead of developing your own ETL pipeline for use with Redshift, you can use Amazon’s managed ETL service, AWS Glue, which runs these jobs in a serverless Apache Spark environment. AWS Glue is not only a data model organization, schema discovery and cataloguing product that can help sort data across your multiple data source inputs, it’s also an easy way to load your data to and from Redshift.
AMAZON KINESIS DATA FIREHOSE
Use Kinesis Data Firehouse to capture, transform and load real-time streaming data directly into Redshift in near real-time. You can check out a simple temperature sensor example here.
Connect with QuickSight for Business Intelligence analytics, creating dashboards and charts for users across your organisation.
“Big data is the ‘New Oil’: the black gold of the 21st century.”
The insights you can mine from your data are extremely valuable.
With good data analysis, big data can help you understand your business and your customers in a way that was previously impossible.
Use Cases for Amazon Redshift
MISSION CRITICAL WORKLOADS
As mentioned, the Concurrency Scaling feature makes Redshift entirely scalable, plus the in-built fault tolerance means it can be used for mission critical workloads.
(NEAR) REAL-TIME STREAMING DATA ANALYSIS
Because of Redshift’s performance capabilities, it makes it a clever part of a solution that requires real-time analytics of large-scale streaming data. This could be, for instance, gleaning insights from a global, enterprise-wide set of IoT devices. Here’s an example of how to do this in practice with Heroku.
WEB LOG AND CLICKSTREAM USER ENGAGEMENT ANALYSIS
Tech-first enterprises accumulate large amounts of data generated by user interaction with websites and web and mobile apps. Pair both historic and incoming data with Redshift as part of an analysis stack to gain previously undiscovered user engagement insights.
Case Studies for Amazon Redshift
To find out how we used Amazon Redshift at Jetstar to deliver a highly-scalable and resilient Advanced Data Analytics platform, providing insights and predictions to support business decisions, read Jetstar: Moving to a Fully-Enabled Data Insights Platform.
To find out how we used Amazon Redshift at Medibank to build a centralised data platform which allows internal Medibank business units to easily analyse and access large volumes of data, read Medibank: Transforming the Customer Experience with Centralised Big Data Platform.
For an in-depth overview of Redshift from an insider’s perspective (including systems rollover), take a look at Amazon Redshift - Fundamentals by Jef Claes.