Amazon Athena is an interactive query service where you can query your data in Amazon S3 using standard SQL statements. Amazon Athena only reads your data, it will not add to or modify it. So you can think of it as only being able to execute SELECT
statements.
Today, we’re going to take closer look at Amazon Athena pricing and how you can reduce your Athena costs.
According to the Amazon Athena Pricing page, Athena is priced at $5 per TB (terabyte) scanned per query execution. There is a 10 MB data scanning minimum per execution. You are not charged for failed queries. If you cancel a query, you are charged for the data scanned up to the point of cancelling the query.
Doing that math for smaller queries:
$5 / 1024 / 1024 = $4.768e-6
So you will be charged $0.000004768 per MB scanned with a $0.00004768 minimum charge (for the 10 MB scanning minimum). So be careful of those 200 KB queries. You will still be charged for a full 10 MB.
Database, table, schema, and DDL-related executions are all free. For example, there is no charge for any of the following statements:
CREATE EXTERNAL TABLE
ALTER TABLE
MSCK REPAIR TABLE
Amazon Athena reads your data stored in Amazon S3. There will be normal S3 data charges for the storage of that data, depending on how it’s stored.
Amazon Athena stores query history and results in a secondary S3 bucket. So there will also be normal S3 data charges for that new data stored in that bucket as well.
Since all results are stored back into S3, you are going to pay for that storage. To reduce the cost of historical queries, you can use S3 lifecycle rules to delete old results.
Amazon Athena pricing is based on the bytes read out of S3. It’s not based on the bytes of record data read into Athena. So, if your data is compressed in S3, then that will help reduce the Athena costs.
By simply using GZIP on your input files before they are placed into S3, you can reduce your costs.
For example, your log file is 20 MB uncompressed. You used GZIP to compress the log file down to 10 MB before placing it in S3. Then when scanning that log file, you will pay only for 10 MB.
Without partitions, all of your data needs to be scanned simply to eliminate it from the results by your WHERE
clause.
By structuring your data in S3 using prefixes, you can use partitions to eliminate large amounts of data from being read from S3.
For example, if your data had a column such as CustomerId
and your queries resembled the following:
SELECT * FROM table1
WHERE CustomerId = 'cus_1'
Then you can structure your data in S3 using prefix folders, such as:
cus_1/DataFile1.json.gz
if you manually add your partitions using ALTER TABLE
, or
CustomerId=cus_1/DataFile1.json.gz
if you want to use MSCK REPAIR TABLE
to automatically load your partitions. And yes, that is a key=value
pair in the S3 object’s key name. That is how Athena knows the partition information.
By storing your data in S3 in a columnar format, you can reduce the amount of data read from S3. Athena likes the Apache Parquet or ORC formats.
Note, if your queries are SELECT * FROM ...
, then you are reading all columns, and you won’t benefit from columnar storage. To take advantage of columnar storage, explicitly specify the columns you want:
SELECT `Col1`, `Col2` FROM tableName
Amazon Athena is a very exciting new service. Be aware of the pricing structure. Structure your data and queries to reduce your costs as much as possible, and you’ll have a fantastic and powerful new candidate to be added to your arsenal for serverless computing.
Skeddly is the leading scheduling service for your AWS account. Using Skeddly, you can:
Sign-up for our 30 day free trial or sign-in to your Skeddly account to get started.