Modern cloud-based data services have revolutionized the way companies manage their data. Tools such as Amazon Athena and Amazon Redshift have changed data warehouse technology, catering for a move towards interactive, real-time, analytical solutions.
Both Amazon Athena and Redshift offer their own unique benefits and use cases. Athena provides a cheaper and more portable way to query data while Redshift offers unrivalled performance and scalability.
The following article provides a brief comparison of the Amazon Athena and Redshift data services. By understanding the main uses of each and comparing them under key headings, you can come to a more informed decision in choosing the right tools for your company’s data needs.
Amazon Redshift is a data warehouse best suited to pulling together data from a disparate variety of sources. For example, you can combine data from inventory systems, financial systems, and retail sales systems into a common format for business intelligence and reporting.
Redshift is optimized for high-performance, allowing you to quickly run complex queries that join large numbers of huge database tables.
Redshift requires you to setup and manage servers in collections known as clusters. Nodes are servers that can communicate with each other inside a Redshift cluster.
Amazon Athena is a query service that doesn’t require you to setup or manage any infrastructure.
Athena lets you easily query data stored in the cloud on Amazon’s S3 data storage service. A query service like Athena is a faster option for getting information from your data. You simply define a table in Athena and use standard SQL to query data without worrying about the format of the data.
Note that because of their different use cases, it can be good idea to use both Amazon Athena and Redshift together.
You can use Amazon S3 to create a staging database before loading the data into Redshift.
You can register the staging database with Amazon Athena to query the same data. The benefits of using both services together are:
You can get Athena up and running in minutes. Athena requires zero infrastructure—it directly queries data already stored on Amazon S3.
Redshift takes much longer to set up. You need to prepare a cluster, chose the right settings for it, and load data into tables.
Partitioning improves performance by ensuring queries only run on relevant data grouped into smaller tables.
In Athena, the price you get charged for the service depends on the bytes scanned. So, intelligently partitioning data leads to cost benefits in Athena. You can use any key to partition data with Athena—the maximum partitions per table is 20,000.
Redshift does not support table partitioning by default. Rather, Redshift uses defined distribution styles to optimize tables for parallel processing. It’s vital to choose the right keys for each table to ensure the best performance in Redshift.
Amazon has recently added the ability to perform table partitioning using Amazon Spectrum. It is possible to partition external tables on one or more columns. Partitioning the external tables improves performance, because the Amazon Redshift query optimizer eliminates partitions that don’t contain data.
Redshift doesn’t enforce a Primary Key constraint for the data you load, meaning inaccurate results due to duplication are a possibility. Primary keys and foreign keys are used as planning hints. Therefore, primary keys should be declared if your ETL process or some other process in your application enforces their integrity.
For Amazon Athena, duplication can occur where the underlying Amazon S3 datasets contain duplicate values.
Amazon Athena supports many data formats, including CSV, TSV, and JSON. Athena also supports open source columnar formats and compressed data formats, such as Snappy, Zlib, LZO, and GZIP. Redshift supports JSON (simple, nested), CSV, TSV, and Apache logs. Data Types Redshift does not support some complex data types such as arrays, geometric types, and XML.
Athena supports arrays, maps, and structs.
With Amazon Athena, you pay only for the queries you run. Athena costs $5 per terabyte of data scanned during a query execution. The minimum data that gets scanned in one execution is 10 megabytes. You don’t get charged for failed queries.
Redshift pricing is based on the type and number of nodes in your cluster. You pay an hourly rate of $0.250 to $4.800 for Dense Compute nodes, or $0.850 to $6.800 hourly for a Dense Storage node.
Choosing between Redshift and Athena depends on what you need from your data. Both of these data services provide quite different functions for organizations, so it’s not usually an either or case.
Saying that, the below criteria give a quick way to decide between the two:
Skeddly is the leading managed scheduling service for your AWS account. Using Skeddly, you can: