In our last article, we took a closer look at how to start the first project in Amazon EMR. You probably remember that it is a very powerful tool, but unfortunately, it might be tricky and not so easy to use by inexperienced users. It is in fact. To analyze data with EMR you need to provide a script to map and reduce your data and also some third-party services like Spark, Presto or Hive for instance. Eventually, using Amazon EMR may be demanding, time-consuming and requires some extra knowledge.
So, what to do if you just need to quickly analyze some data like your network devices log files for troubleshooting and don’t have time neither want to learn all of that stuff hiding behind EMR? The only answer is to leverage Amazon Athena, which is a perfect choice for such cases. It makes it easy to run interactive queries against data directly in Amazon S3 using standard SQL. You don’t need to prepare data with EMR before. All you need to do is just to point where your data is, choose its format and describe its structure and that’s it. Within a while, you can run SQL query against these data and get results.
Although Athena is a great tool, it will not solve all your problems. For more sophisticated data analysis you should choose more advanced tools like EMR or RedShift, but remember that you can still use Athena against data processed by EMR for instance. Athena’s data catalog is Hive compatible as Athena uses Hive for DDL (Data Definition Language). If you already have a Hive metastore on EMR, you can simplify your DDL statements on Amazon Athena and you can start querying your data without impacting an EMR jobs.
Going deeper – pricing
Ok, you already know, that Athena is best suited for quick, ad hoc data processing directly from S3. It’s perfect for analyzing logs from ELBs or data provided by Amazon Kinesis for instance. But before I’ll show you how to start your first project in Athena, let’s discuss its pricing and few aspects related to pricing.
Amazon Athena is priced per query. Charges are based on the amount of data scanned by Athena. You are not charged for failed requests, but you are charged for the requests started and then canceled by you. This means, that when you analyze large amounts of data, you may pay a lot and you need to do something to pay less. There are few options.
First, you may compress your data. Second, you may partition your data. Third, you may convert data to a columnar storage format.
All of these operations reduce the amount of data that is needed to be scanned to execute a query and therefore you pay less.
Data compression is well-known so we can omit it. Just to mention, that Athena accepts data in Snappy, Zlib, LZO, and GZIP formats. What about partitioning and converting to a columnar storage?
Well, partitioning is splitting data into partitions. If you have your data in a time-series format, you can split it across multiple dimensions like yearmonthday or regioncountrycity to restrict Athena to query only a specific partition. For instance, if you want to query specific item within a specific day, you can just query a data from that particular day, and not to scan the entire data, which reduces costs and improves performance.
Converting to columnar formats is another option to cut costs and improve performance. Athena can work on data converted to open source formats such as Apache Parquet or ORC.
To convert data, you can use Amazon EMR cluster with Hive. Once it’s done, you can create an Athena table that uses data produced by the cluster.
And deeper – starting a project
You know the basics so let’s start our first project in Athena. We will work on a data freely available to download from https://catalog.data.gov. It is a file with the total population (in millions) by country, 1980 to 2010, compiled by Energy Information Administration.
Data format looks like below.
We will remove the very first row which is a header with column names. It will not be needed, as we create our own column headers.
At the beginning, data has to be stored in S3 folder, so create one and upload a data file. Of course, you can use an existing folder with existing data, such as ELB logs.
Once data is on S3, it’s time to create our first query. Go to Athena, Catalog Manager and click Add Table.
Some basic information has to be provided. You can create a new database or add query and data to the existing one, provide a table name and point to the data location which is S3 bucket. Click next.
Then you have to specify the input data format. We will use CSV as it has to match our data file. For other input formats, use different settings. Click next.
In the third step, we need to create table structure, which means columns with names and types. Try to avoid column names made of digits only, as it will ease future data processing.
For data types per column, you can choose from the below.
You can add columns one by one or use bulk add function. As our data file contains many columns (it is one column in fact, but will be divided into many), one column per year, we will use bulk add.
Use name-value format separated by commas, then click Add and Next.
In the last step, you have to decide if your data is partitioned. As our data is not, we will leave this step.
Once all the setup is completed, click Create table and it’s done. Below query is executed and new table is created.
You can now work on your data using SQL queries. Let’s display all data. To do it, use simple
SELECT * FROM populationByCountry;
statement. Try to avoid such queries for a large amount of data and limit the output instead.
Notice, that for relatively small amount of data it took almost 3 seconds just to display all the data. This is the processing time you will be charged for.
Adding some partitioning
If you want to learn more, Athena provides a simple tutorial in which you will create a table using ELB logs, which are provided publicly by Amazon. We will use the same steps as in the tutorial, but with a small addition of data partitioning.
Here are the properties of our new database:
Take a look at the Partitions tab. Data is stored in S3 bucket using /year/month/day structure, which has to be reflected in partitions. You will see that database uses a partitioned table as it shows (Partitioned) at the table name.
Before we are able to process partitioned data, one more step has to be done. We have to load metadata related to the partition. To load data from /2015/01/01 folder, you have to run the following
ALTER TABLE elb ADD PARTITION (year=’2015′,month=’01’,day=’01’) location ‘s3://athena-examples-eu-west-1/elb/plaintext//2015/01/01/’
Remember to ALTER table whenever you want to start processing new data.
Once done, you can restrict queries with the partitions, using WHERE clause. It will make Athena to scan faster and cheaper.
You can also use more sophisticated queries. For instance, let’s find all the 4xx and 5xx http responses on 2015/01/02. How to do it? The query is quite simple. Select columns from a table with some conditions and voila!
WHERE (backend_response_code LIKE ‘4%’ OR
backend_response_code LIKE ‘5%’) AND
Once you finish, you can save results in CSV format for further processing.
It’s just a beginning of what you can do with Athena. Possibilities are almost infinite. It’s a perfect tool for you to do ad hoc analysis of log files for instance. It’s fast, easy to use and does not require any complicated prerequisite tasks like EMR does. All you need to do is to have your data on S3, and you can start query your logs with SQL. And you pay only for the processing power you use. That’s amazing. This is what cloud services are for.
You will now have few days to play with Athena, and then we will see in the next part of our articles series. Have a good time!