Automatic Partitioning With Amazon Athena

Amazon Athena pricing is based on the bytes scanned. Anything you can do to reduce the amount of data that’s being scanned will help reduce your Amazon Athena query costs.

In our previous article, Partitioning Your Data With Amazon Athena, we partitioned our data into folders to reduce the amount of data scanned. But those partitions were being loaded into our Athena table manually.

In this article, we will show how to load the partitions automatically.

Our Input Data

Like the previous articles, our data is JSON data. One record per file. One record per line:

{ "name": "Albert", "numPets": 1, "birthDate":  "1970-01-01 07:20:00", "LargeDataItem": "..." }

Previously, we partitioned our data into folders by the numPets property. Here is a listing of that data in S3:

$ aws s3 ls s3://athena-testing-1/Partitioned/ --recursive
2016-12-19 20:52:52       2908 Partitioned/0/Data3.json
2016-12-19 20:52:52       2905 Partitioned/0/Data5.json
2016-12-19 20:52:52       2907 Partitioned/1/Data1.json
2016-12-19 20:52:52       2906 Partitioned/2/Data2.json
2016-12-19 20:52:52       2907 Partitioned/3/Data4.json

With the above structure, we must use ALTER TABLE statements in order to load each partition one-by-one into our Athena table. However, by ammending the folder name, we can have Athena load the partitions automatically.

In order to load the partitions automatically, we need to put the column name and value in the object key name, using a column=value format. For example, we would use numpets=1 for the folder, instead of just 1.

Important: The column name in the object key name must be all lower-case. So, our column in our JSON data is numPets. But in the object key name in S3, we must use numpets.

Our new data listing becomes:

$ aws s3 ls s3://athena-testing-1/AutoPartitioned/ --recursive
2016-12-21 14:47:29       2908 AutoPartitioned/numpets=0/Data3.json
2016-12-21 14:47:29       2905 AutoPartitioned/numpets=0/Data5.json
2016-12-21 14:47:29       2907 AutoPartitioned/numpets=1/Data1.json
2016-12-21 14:47:29       2906 AutoPartitioned/numpets=2/Data2.json
2016-12-21 14:47:29       2907 AutoPartitioned/numpets=3/Data4.json

We only have a single partition using the numPets column. If we wanted to add a second partition, we could do so, like AutoPartitioned/numpets=1/name=Albert/.... The second partition would need to be reflected in the PARTITIONED BY clause of the CREATE EXTERNAL TABLE statement.

Creating the Tables

Creating the table for our auto-partitioned data is identical to our previous SQL script:

CREATE EXTERNAL TABLE IF NOT EXISTS TestDb.AutoPartitionedTable1 (
  `name` string,
  `birthDate` timestamp
)
PARTITIONED BY (numPets int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://athena-testing-1/AutoPartitioned/'

The only difference from before is the table name and the S3 location.

Adding Partitions

Previously, we added partitions manually using individual ALTER TABLE statements.

This time, we’ll issue a single MSCK REPAIR TABLE statement. This statement will (among other things), instruct Athena to automatically load all the partitions from the S3 data.

So we issue the following command to load the partitions:

MSCK REPAIR TABLE TestDb.AutoPartitionedTable1

and we get the following as the output:

Partitions not in metastore:	autopartitionedtable1:numpets=0	autopartitionedtable1:numpets=1	autopartitionedtable1:numpets=2	autopartitionedtable1:numpets=3
Repair: Added partition to metastore TestDb.AutoPartitionedTable1:numpets=0
Repair: Added partition to metastore TestDb.AutoPartitionedTable1:numpets=1
Repair: Added partition to metastore TestDb.AutoPartitionedTable1:numpets=2
Repair: Added partition to metastore TestDb.AutoPartitionedTable1:numpets=3

If you don’t see the “Added partition” lines in your output, then your partitions did not get added.

After you load the partitions, you can view your table’s partitions using the SHOW PARTITIONS statement:

SHOW PARTITIONS TestDb.AutoPartitionedTable1

and we’ll see:

numpets=0
numpets=2
numpets=3
numpets=1

Querying the Data

Just like before, we can query our data using standard SQL:

SELECT * FROM AutoPartitionedTable1
WHERE numPets > 1

TL;DR

Partitioning your data can dramatically reduce the amount of data scanned during your Athena queries. If you can control the format of the object key names in S3, you can take advantage of Athena’s ability to automatically load the partitions for you. This eliminates the need to manually issue ALTER TABLE statements for each partition, one-by-one.

Please note, by default Athena has a limit of 20,000 partitions per table. This limit can be raised by contacting AWS Support. Whatever limit you have, ensure your data stays below that limit.

Articles In This Series

About Skeddly

Skeddly is the leading scheduling service for your AWS account. Using Skeddly, you can:

Reduce your AWS costs,
Schedule snapshots and images, and
Automate many DevOps and IT tasks.

Sign-up for our 30 day free trial or sign-in to your Skeddly account to get started.

By Matt Houser on Dec 22, 2016 in Amazon Athena | Permalink

Subscribe via RSS

Skeddly Blog