So, you’ve read our post about Big Data on AWS and decided that this is the ideal solution for analyzing and processing your data. Now you know the basics and the tools, and so you’re determined to start your first Big Data project. However, you’re still feeling overwhelmed with all the choices, and don’t know where to begin.
Sounds familiar? Take it easy, we’re here to help and to offer you advice on how to start your first Big Data project on AWS. This is the first of a series of articles.
As you probably remember, AWS offers a variety of tools that let you handle large amounts of data in the cloud. You can use various cloud Amazon services to collect, store, transform, process, analyze and visualize data. In Part I of our manual we will deal with data processing. Let’s assume that you had already collected your data that now needs to be processed.
Ready to get started?
Amazon Elastic Map Reduce
The AWS service that you need to process your Big Data is Amazon Elastic MapReduce (Amazon EMR). It is a managed cluster platform that simplifies running Big Data frameworks on AWS. Amazon EMR provides a managed Hadoop framework and related open-source projects to enable processing and transforming data for analytics and business intelligence purposes in an easy, fast and cost-effective way.
But before we delve into Amazon EMR, let’s learn a bit more about MapReduce.
What is MapReduce?
Imagine that you’re running a global online store selling luxury goods. It’s profitable, but you’d like to get closer to your best customers, so you decide to open a few retail stores in the locations with the highest number of online shoppers. Fortunately, you keep all your order history in log files. Less luckily, you’ve just realized that your business does better than you expected, and so now you need to analyze hundreds of gigabytes of logs.
How can you take a look inside such massive data collections? How do you retrieve specific information from large data repositories? You could achieve this using a processing system that divides datasets into smaller jobs, distributes them to computing nodes, processes these input data sections in parallel, and finally combines them into the final output.
Here comes MapReduce.
MapReduce is a programming model with a distributed algorithm that can be running in parallel on a cluster. Basically, it performs two functions: map and reduce.
Usually, log files have a complicated structure and contain massive amounts of data. Your logs may include time and date of an order, customer name, order number, destination address, and much more. For simplicity’s sake, let’s assume that each line of your log file contains only timestamp for the order and a destination city.
The data to process may look as follows:
To process that input data with MapReduce, you first need a map function. Map function takes a key, value pair as an input and returns one or more key, value pairs as an output.
In your case, it will take one line from the input data as the input value and returns the output as <City, 1>.
For the above input data, it will look like:
The map function ends up here, but before the reduce function can start, MapReduce framework groups all the map output as a key, value. The key is the same as for the output of the map function, but the value is now the list of all values having the same key.
It will look as follows:
Each of these pairs will be an input for the reduce function. The function only has to sum up the number of elements on the input list and return the output key, value pair in the form of <City, number of occurrences>.
For your data it will look as follows:
And there we go! With these results, you may decide to open your retail store in City1.
Would you like to know to which country you sell most? Not a problem. Change the map function to map city to country and return key, value as follows: <Country, 1>. Easy-breezy.
AWS EMR basics
Super! You’ve learned the basics of the MapReduce framework. Let’s move to AWS. You already know that for data processing you’ll need the Amazon EMR service, which uses Hadoop.
Apache Hadoop is a platform for Big Data storage and processing, and it is the most popular open-source implementation of the MapReduce framework. It uses Hadoop Distributed File System (HDFS), which provides availability, scalability and fault tolerance, but also enables Hadoop to process data in a distributed manner.
Apart from Hadoop, EMR uses other open-source projects, such as:
- Apache Spark
- Apache HBase
Availability of the above applications depends on EMR release. You’ll find applications supported by a particular release of EMR here: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html.
You can choose the EMR version in the Advanced Options when creating an EMR cluster:
MapReduce requires functions to process data. So does EMR. You can use several different programming languages, for instance, Java, Python, .NET or Scala, and open-source applications like Hive or Pig to execute that task. The code can be automatically launched from S3, or you can use an interactive mode to start it manually.
Spinning up the first cluster
Let’s start with our first cluster. We can spin it up with Quick Options, which is the default way, or by using the Advanced Options.
Quick Options are much faster but have some restrictions like lack of debugging or automatic choice of VPC. With Quick Options, you also have to install specific, preconfigured sets of applications.
Another constraint of the Quick Options setting is that you can only use one instance type for all nodes. Only the latest EMR releases are available, and only the default security groups can be used (you can change it after the cluster is launched).
Only large and extra-large instances can be used for the EMR cluster. You can use spot instances, but you’ll need to launch the cluster with the Advanced Options in that case.
Advanced Options are advised if you require more cluster customization. For example, you may personalize software to launch, design instance types, choose VPC and security groups, etc.
With the Advanced Options, you can also configure bootstrap actions to install additional software and customize your applications. AWS maintains a repository of open-source bootstrap actions you can use, available here: https://github.com/awslabs/emr-bootstrap-actions.
When running an EMR cluster, you can choose between long-running and transient clusters. As the names suggest, long-running clusters stay up and running for a long time and are dedicated to different tasks than the transient ones.
If you want to run queries against a database or execute frequent jobs on a cluster, you should choose a long-running cluster. For clusters processing batch jobs that can be terminated after processing, use the transient setting. This will also help you reduce costs.
Launching EMR cluster
Launching a cluster is easy. First, from the AWS Services menu, the Analytics section, choose EMR. You will then get to the Amazon EMR service configuration.
Then click the [Create cluster] button.
You will see the configuration options. Set the name for your cluster, enable or disable logging, and choose if you want to run the cluster interactively or using the step execution.
Let’s select the “Cluster” mode. This will create a long-running cluster.
For transient clusters, you should check “Step execution” or use the Advanced Options and choose the step to execute with termination after the last step is completed.
Now, you need to choose EMR release and a preconfigured set of applications to run on the cluster. You also have to select instance type, number of instances and configure basic security settings. Remember that if you want to be able to log in to a cluster, you need to provide the EC2 key pair.
Finally, click [Create cluster] and your cluster will launch.
When it’s done, you can move on to your Big Data project. Congratulations!
You have just completed the first step towards your first Big Data project on AWS. Your cluster is up and running; the project has kicked off. Now it’s time to start compiling programs and using open-source applications that are available with EMR to process your data. EMR offers unlimited possibilities of data processing. You can process and analyze logs, Extract, Transform and Load (ETL), analyze clickstreams or even play with machine learning.
In the next part of the series, we will introduce more services that you can use for Big Data processing. For now, review what you’ve learned about EMR, and try to apply that in practice.
More information on the topic can be found in the project’s documentation: Hadoop MapReduce tutorial http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html or Hive documentation https://cwiki.apache.org/confluence/display/Hive/Home#Home-UserDocumentation.
Stay tuned for Part II!