Partitioning Your Data With Amazon Athena

Amazon Athena pricing is based on the bytes scanned. Anything you can do to reduce the amount of data that’s being scanned will help reduce your Amazon Athena query costs.

In our previous article, Getting Started with Amazon Athena, JSON Edition, we stored JSON data in Amazon S3, then used Athena to query that data.

In this article, we will partition the data, and compare the results.

Partitioning data means that we are splitting the data up into related groups of data. When querying data, if the query is constrained to a particular partition, then you can quickly eliminate all other partitions.

Our Input Data

Like the previous articles, our data is JSON data. One record per file. One record per line:

{ "name": "Albert", "numPets": 1, "birthDate":  "1970-01-01 07:20:00", "LargeDataItem": "..." }

For our unpartitioned data, we placed the data files in our S3 bucket in a flat list of objects without any hierarchy. Here are our unpartitioned files:

$ aws s3 ls s3://athena-testing-1/Unpartitioned/ --recursive
2016-12-19 20:53:14       2907 Unpartitioned/Data1.json
2016-12-19 20:53:14       2906 Unpartitioned/Data2.json
2016-12-19 20:53:14       2908 Unpartitioned/Data3.json
2016-12-19 20:53:14       2907 Unpartitioned/Data4.json
2016-12-19 20:53:14       2905 Unpartitioned/Data5.json

Here are our partitioned files:

$ aws s3 ls s3://athena-testing-1/Partitioned/ --recursive
2016-12-19 20:52:52       2908 Partitioned/0/Data3.json
2016-12-19 20:52:52       2905 Partitioned/0/Data5.json
2016-12-19 20:52:52       2907 Partitioned/1/Data1.json
2016-12-19 20:52:52       2906 Partitioned/2/Data2.json
2016-12-19 20:52:52       2907 Partitioned/3/Data4.json

You’ll notice that the partitioned data is grouped into “folders”. In this example, the partitions are the value from the numPets property of the JSON data.

Creating the Tables

We are going to create 2 tables: one for the unpartitioned data, and one for the partitioned data. We used the following SQL script to create our unpartitioned data table:

CREATE EXTERNAL TABLE IF NOT EXISTS TestDb.UnpartitionedTable1 (
  `name` string,
  `numPets` int,
  `birthDate` timestamp
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://athena-testing-1/Unpartitioned/'

And we used the following SQL script to create our partitioned data table:

CREATE EXTERNAL TABLE IF NOT EXISTS TestDb.PartitionedTable1 (
  `name` string,
  `birthDate` timestamp
)
PARTITIONED BY (numPets int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://athena-testing-1/Partitioned/'

Notice that the numPets column was removed from the list of columns. Instead, we define it in the PARTITIONED BY clause of the SQL statement.

Adding Partitions

When partitioning your data, you need to load the partitions into the table before you can start querying the data. There are two ways to load your partitions.

Manually add each partition using an ALTER TABLE statement. This needs to be explicitly done for each partition.
Automatically add your partitions using a single MSCK REPAIR TABLE statement. In order to do this, your object key names must conform to a specific pattern.

In this example, we will manually add our partitions. So we executed the following statements:

ALTER TABLE TestDb.PartitionedTable1 ADD PARTITION (numPets=0) location 's3://athena-testing-1/Partitioned/0/'
ALTER TABLE TestDb.PartitionedTable1 ADD PARTITION (numPets=1) location 's3://athena-testing-1/Partitioned/1/'
ALTER TABLE TestDb.PartitionedTable1 ADD PARTITION (numPets=2) location 's3://athena-testing-1/Partitioned/2/'
ALTER TABLE TestDb.PartitionedTable1 ADD PARTITION (numPets=3) location 's3://athena-testing-1/Partitioned/3/'

Each statement needs to be executed on it’s own.

Querying the Data

With our tables in place, we can execute a SELECT query against the unpartitioned data:

SELECT * FROM UnpartitionedTable1
WHERE numPets > 1

Now, executing the same query against the partitioned table:

SELECT * FROM PartitionedTable1
WHERE numPets > 1

Notice the results from the queries are the same. But the data scanned when querying the partitioned data is significantly less than the data scanned when querying the unpartitioned data.

Table	Rows Returned	Data Scanned
UnpartitionedTable1	2	14.2 KB
PartitionedTable1	2	5.68 KB

TL;DR

If your queries are going to be commonly constrained by a particular column, partitioning your data on that column can be an effective method of reducing the amount of data scanned during your queries. This will reduce your Athena query costs dramatically.

Next, we’ll take a look at automatically partitioning your data so you don’t need to manually add each partition.

Articles In This Series

About Skeddly

Skeddly is the leading scheduling service for your AWS account. Using Skeddly, you can:

Reduce your AWS costs,
Schedule snapshots and images, and
Automate many DevOps and IT tasks.

Sign-up for our 30 day free trial or sign-in to your Skeddly account to get started.

By Matt Houser on Dec 21, 2016 in Amazon Athena | Permalink

Subscribe via RSS

Skeddly Blog