Amazon Athena: Overview, Compatible Services and Use Cases
What is Amazon Athena?
Athena is a serverless query service that allows you to run SQL queries on your data stored in S3. It can effortlessly query large datasets (or Data Lakes), whether they are structured, unstructured, or semi-structured data. Data formats supported include “CSV, TSV, custom-delimited, and JSON formats; data from Hadoop-related formats: ORC, Apache Avro and Parquet; logs from Logstash, AWS CloudTrail logs, and Apache WebServer logs”.
With Athena, you can run queries directly on your data objects without the overhead of firstly loading them into an expensive database or learning a new complex query language.
This service is part of Amazon’s OLAP (Online Analytical Processing) toolset, alongside others such as Kinesis Analytics (for real-time data analysis), Elasticsearch Service (for ELK stack without the operational overheads) and Redshift (for data warehousing and analysis). Athena is based on Presto, an open source, distributed, SQL query engine.
Benefits of using Amazon Athena
Perfect for ad-hoc queries
Athena is a great tool to use if you need a specific answer to a data question that can be framed with SQL quickly. You can run a query and get an answer straight away. It may not be ideal in terms of cost if you’re looking for a tool you can use to run the same queries repeatedly over datasets.
Save money / developer time
Because it’s serverless, this means that you only pay per TB of data scanned by your queries - at the time of print $5/TB. So long as your SQL queries are concise enough to not scan huge datasets every time, this can be a more cost/time effective solution than, say, implementing a Presto solution yourself with server provisioning and configurations or loading the data into a data warehouse. Efficiencies are gained by using a columnar format (Parquet, ORC) and by compressing the objects, often reducing the amount of data scanned, and therefore reducing cost by a factor of 5-10.
Access data from different S3 objects at once
If you have a range of different objects (files) storing your data, it’s simple to run a query across all of them at once. The flexibility of S3 as a storage medium, as a source or as a target for a large range of AWS services increases the scope of data available to Athena. S3 is central to the majority of Data Lakes implemented on AWS.
Easy to create queries
It uses SQL, which is pretty much a native language to everyone working with data.
Security and Access integration
As a native AWS service Athena integrates directly with your IAM users, roles and policies, with your S3 bucket policies and with the encryption keys used to secure your valuable data.
Athena is fast, secure, highly available, and can be accessed directly from the AWS CLI or Console.
Big data is the ‘New Oil’: the black gold of the 21st century.
The insights you can mine from your data are extremely valuable.
With good data analysis, big data can help you understand your business and your customers in a way that was previously impossible.
Complementary AWS services
Beyond S3, there are a number of services that pair combine perfectly with Athena for more complex or automated enterprise applications.
AWS Glue is a service to help organise your data. It crawls your data, determines data formats then suggests schemas and transformations (ETL - Extract, Transform, Load - via editable scripts). This allows you to create a Data Catalogues from disparate data sources that have been combined, increasing the usefulness of your data. The organisation given by the Data Catalogue is available to Athena out-of-the-box.
QuickSight is Amazon’s Business Intelligence pay-per-use service which allows you to create and publish interactive dashboards and charts based on analytics, including machine learning insights. Quicksight can query data with Athena; it’s the perfect way to obtain easy-to-understand insights captured by your queries.
Use Cases for Amazon Athena
Archival log analysis
If you’ve been doing due diligence within your organisation, you’ll have plenty of logs available that may be perfect candidates for further analysis. Run your query, gather your results, then analyse from there.
Examples may be:
- System error logs
- AWS service logs
- Load balancing logs
- Web traffic logs
Here’s a quick example of how to run Athena over Amazon Application Load Balancer logs.
Quickly check new datasets for validity
If you’ve suddenly come across a dataset and you want to see if the data contained within it is actually useful or full of errors, then you can run a quick query to view the results and see if they look logical, need fixing first, or are full of dirty data.
Time-critical ad-hoc data queries
If you have one or a few questions that need resolution quickly, or data reports that need doing “yesterday!”, and you don’t have the time to set up traditional query servers and import the data into them, then Athena is the way to go.
For Data Scientists
Data Scientists can spend more time gathering and pre-processing data than training the models or otherwise gaining value from the data. Athena is a great tool for those aspects of their work and the ability to run queries directly from the notebooks in AWS Sagemaker brings large quantities of data directly to their fingertips.
For a comprehensive overview of Athena, check out AWS’s Amazon Athena Capabilities and Use Cases Overview presentation on SlideShare.