Amazon Athena pricing is based on the bytes scanned. Anything you can do to reduce the amount of data that’s being scanned will help reduce your Amazon Athena query costs.
In our previous article, Partitioning Your Data With Amazon Athena, we partitioned our data into folders to reduce the amount of data scanned. But those partitions were being loaded into our Athena table manually.
In this article, we will show how to load the partitions automatically.
Like the previous articles, our data is JSON data. One record per file. One record per line:
{ "name": "Albert", "numPets": 1, "birthDate": "1970-01-01 07:20:00", "LargeDataItem": "..." }
Previously, we partitioned our data into folders by the numPets
property. Here is a listing of that data in S3:
$ aws s3 ls s3://athena-testing-1/Partitioned/ --recursive
2016-12-19 20:52:52 2908 Partitioned/0/Data3.json
2016-12-19 20:52:52 2905 Partitioned/0/Data5.json
2016-12-19 20:52:52 2907 Partitioned/1/Data1.json
2016-12-19 20:52:52 2906 Partitioned/2/Data2.json
2016-12-19 20:52:52 2907 Partitioned/3/Data4.json
With the above structure, we must use ALTER TABLE
statements in order to load each partition one-by-one into our Athena table. However, by ammending the folder name, we can have Athena load the partitions automatically.
In order to load the partitions automatically, we need to put the column name and value in the object key name, using a column=value
format. For example, we would use numpets=1
for the folder, instead of just 1
.
Important: The column name in the object key name must be all lower-case. So, our column in our JSON data is numPets
. But in the object key name in S3, we must use numpets
.
Our new data listing becomes:
$ aws s3 ls s3://athena-testing-1/AutoPartitioned/ --recursive
2016-12-21 14:47:29 2908 AutoPartitioned/numpets=0/Data3.json
2016-12-21 14:47:29 2905 AutoPartitioned/numpets=0/Data5.json
2016-12-21 14:47:29 2907 AutoPartitioned/numpets=1/Data1.json
2016-12-21 14:47:29 2906 AutoPartitioned/numpets=2/Data2.json
2016-12-21 14:47:29 2907 AutoPartitioned/numpets=3/Data4.json
We only have a single partition using the numPets
column. If we wanted to add a second partition, we could do so, like AutoPartitioned/numpets=1/name=Albert/...
. The second partition would need to be reflected in the PARTITIONED BY
clause of the CREATE EXTERNAL TABLE
statement.
Creating the table for our auto-partitioned data is identical to our previous SQL script:
CREATE EXTERNAL TABLE IF NOT EXISTS TestDb.AutoPartitionedTable1 (
`name` string,
`birthDate` timestamp
)
PARTITIONED BY (numPets int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://athena-testing-1/AutoPartitioned/'
The only difference from before is the table name and the S3 location.
Previously, we added partitions manually using individual ALTER TABLE
statements.
This time, we’ll issue a single MSCK REPAIR TABLE
statement. This statement will (among other things), instruct Athena to automatically load all the partitions from the S3 data.
So we issue the following command to load the partitions:
MSCK REPAIR TABLE TestDb.AutoPartitionedTable1
and we get the following as the output:
Partitions not in metastore: autopartitionedtable1:numpets=0 autopartitionedtable1:numpets=1 autopartitionedtable1:numpets=2 autopartitionedtable1:numpets=3
Repair: Added partition to metastore TestDb.AutoPartitionedTable1:numpets=0
Repair: Added partition to metastore TestDb.AutoPartitionedTable1:numpets=1
Repair: Added partition to metastore TestDb.AutoPartitionedTable1:numpets=2
Repair: Added partition to metastore TestDb.AutoPartitionedTable1:numpets=3
If you don’t see the “Added partition” lines in your output, then your partitions did not get added.
After you load the partitions, you can view your table’s partitions using the SHOW PARTITIONS
statement:
SHOW PARTITIONS TestDb.AutoPartitionedTable1
and we’ll see:
numpets=0
numpets=2
numpets=3
numpets=1
Just like before, we can query our data using standard SQL:
SELECT * FROM AutoPartitionedTable1
WHERE numPets > 1
Partitioning your data can dramatically reduce the amount of data scanned during your Athena queries. If you can control the format of the object key names in S3, you can take advantage of Athena’s ability to automatically load the partitions for you. This eliminates the need to manually issue ALTER TABLE
statements for each partition, one-by-one.
Please note, by default Athena has a limit of 20,000 partitions per table. This limit can be raised by contacting AWS Support. Whatever limit you have, ensure your data stays below that limit.
Skeddly is the leading scheduling service for your AWS account. Using Skeddly, you can:
Sign-up for our 30 day free trial or sign-in to your Skeddly account to get started.