Big Data describes complex datasets, which are too large for traditional data processing software due to the increasing volume, velocity, and variety of data. These three terms – volume, velocity, and variety – are commonly known as the “three V’s” of Big Data:
- Volume – a Big Data set ranges from terabytes to petabytes of data.
- Variety – it includes data from various sources and in different formats. For instance, these can be logs or online financial transactions.
- Velocity – in Big Data, data needs to be collected, stored, processed and analyzed within short timeframes (from real-time to once per day).
Analyzing extremely large datasets requires significant computing capacity, which can be difficult and expensive to provide. Here comes the cloud, as Big Data workloads are ideally suited for the cloud computing model. You can choose as robust an environment as you need, and resize it on-demand. You no longer have to wait for new hardware to ship or need to keep redundant resources.
AWS Tools and Services for Big Data
Amazon Web Services (AWS) offers a broad range of services to help you build and deploy Big Data analytics quickly. The company also gives you many options for sending data to the cloud. The large variety of tools and services may seem overwhelming, but it’s just a toolset. You don’t have to use them all. Pick the right tools for the job, start small with one or two services, and when you feel comfortable, add new ones.
Anyway, it’s always good to know your options, so let’s take a look at available services and find out how Big Data works on AWS.
Describing every possible use case for these services is impossible. Instead, we will focus on what they do and how you can use them. You will have to do your homework to decide how to adapt these tools to your own needs. Shall we begin?
Collect your data with Amazon Kinesis
You can think of Big Data as a kind of a batch process where data is collected, processed and analyzed. In the end, the process provides some output and visualizes it.
When dealing with Big Data on AWS, you’ll most likely use S3 (Simple Storage Service) to store your data. There are different ways to transfer data to S3; you can upload it manually, use the AWS Import/Export service or other services that can use S3 as data storage, for example, Amazon Kinesis.
Amazon Kinesis is a kind of a gateway to Big Data solutions. The service allows you to easily load streaming data into AWS with Kinesis Firehose, which can capture, transform, and load streaming data not only into S3 but also to other services like Redshift or Elasticsearch.
If you need to perform more actions than just loading streaming data into other services, you should use Kinesis Streams, which is a more customizable version. You can use it to do some custom processing with streaming data, which is impossible with Firehose. Kinesis Streams is convenient when you need to move data rapidly off data sources and continuously process it.
Here are some typical scenarios for its usage:
- Real-time data analytics on streaming data
- Log and data feed intake and processing
- Real-time metrics and reporting
Keep in mind that Kinesis Streams retains data for 24 hours, which you can extend up to 7 days with an extra fee. If you need to store data for longer time, consider moving it to S3, Glacier, Redshift or DynamoDB.
Use Amazon EMR for processing
Once your data is collected, it’s time for further processing and analysis. Depending on your needs, different tools can be used.
When talking about Big Data, Hadoop is one of the first services that come to mind. It is an open-source framework used for distributed storage and processing of large datasets. Amazon provides Hadoop framework as a managed service called Amazon EMR. It makes processing of vast amounts of data across dynamically scalable EC2 instances easy, fast and cost-effective.
Hadoop processing can be combined with several AWS products to enable such tasks as web indexing, data mining, log file analysis, machine learning, scientific simulation, and data warehousing.
If you prefer to use frameworks other than Hadoop, you can choose from HBase, Presto and Apache Spark.
You can select software configuration manually using the advanced options.
These frameworks combined with related software such as Hive or Pig enable data processing for analytics purposes and business intelligence workloads. With Amazon EMR you can also transform and move large amounts of data in and out of other AWS stores and databases.
Leverage Redshift for easy analytics
Once data is transformed with Amazon EMR, it is formatted and cleaned and can be moved to S3. Data on Amazon S3 can be consumed by other tools, such as Amazon Redshift.
Redshift is a fast and fully-managed data warehouse that allows you to analyze data using standard SQL and Business Intelligence tools. You can efficiently run complex queries against petabytes of structured data. Redshift uses query optimization, columnar storage on high-performance local disks and parallel query executions, which makes it very fast. Most results come back in seconds. Data can be loaded to Redshift e.g. from Kinesis or it can be copied from S3.
You can also run queries against unstructured data directly in S3 with Amazon Redshift Spectrum. This requires no ETL or loading and you may use exactly the same SQL as you do for Amazon Redshift.
Visualize data with Amazon QuickSight
For analytics visualization, you can use Amazon QuickSight. It is a business analytics service that lets you build visualizations, perform ad-hoc analysis and gain business insights from your data.
QuickSight uses SPICE – “Super-fast, Parallel, In-memory, Calculation Engine”, which is a combination of columnar storage, in-memory technologies, machine code generation and data compression. It allows you to run interactive queries on large datasets and get rapid responses.
As an input, QuickSight can use not only Amazon Redshift, but also other sources like Amazon RDS, Amazon Aurora, Amazon Athena, Amazon S3, Amazon EMR (Presto and Spark), SQL Server, MySQL, PostgreSQL or simple CSV and Excel files.
And now for something completely different…
So far in this post, we’ve followed the traditional path from data collection to visualization and suggested the applicable tools. However, AWS makes it possible for you to do much more with your data. Here are some other tools that you may want to know.
Amazon Lambda can take part in processing data that will be then deployed to Amazon Machine Learning. AWS Data Pipeline can schedule regular data movement and processing within AWS cloud. Based on that schedule, you can perform regular processing activities such as distributed data copy, SQL transforms, MapReduce application or custom scripts against S3, RDS or DynamoDB.
When an application uses large amounts of data, the search feature becomes critical. Among other analytical tools, Amazon provides two search engines – Elasticsearch and CloudSearch. Both are quite similar and built on proven technologies.
- Amazon Elasticsearch helps you deploy, operate, and scale Elasticsearch clusters in AWS. It uses an open-source Elasticsearch product developed by elastic.co.
- Amazon CloudSearch is a fully-managed service that has an advantage over running self-managed search services with auto-scaling, self-healing cluster and high-availability with Multi-AZ.
Need more? Try AWS Glue and Amazon Athena
The most recent AWS Big Data services are AWS Glue and Athena. They both make Big Data in the cloud easier than ever.
With AWS Glue you can extract, transform, and load data for analytics. It is a pay-as-you-go service, so you pay only when you use it. With Glue, you can ease and automate the time-consuming steps of data preparation. The service can automatically discover and profile your data via the Glue Data Catalog, recommend and generate ETL code to transform data, and run ETL jobs on a fully-managed Apache Spark environment to load data into destinations. Glue can automatically discover structured and semi-structured data stored on Amazon S3, Redshift, and other databases running on AWS.
We’ve briefly discussed almost all Big Data tools provided by AWS, but the best is yet to come. Meet Amazon Athena – a simple, interactive query service that allows you to analyze data in Amazon S3 using a standard SQL. There’s no need to use any data warehouse or Amazon EMR. You just point to your data, run an SQL query, and get the output analysis. It’s an ideal solution for ad-hoc analyses.
As an input for Athena, you can use a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet, and Avro. Athena integrates out-of-the-box with AWS Glue Data Catalog and Amazon Quicksight.
If you want to run ad-hoc queries without aggregating and loading data, Athena is a perfect choice. However, for long-term processing and analysis, it’s worth considering Redshift or EMR, as they may be cheaper for this particular use.
Gotta get them all? Better customize
After this overview of Big Data tools and services in AWS, you should already see that Amazon offers a wide range of products for processing large datasets in the cloud.
As people continue to generate and collect more and more data, AWS offers a way to process, analyze and visualize it without on-prem resources. Services in AWS are easy to deploy and manage and they can help you optimize costs.
However, as mentioned above, with so many tools in AWS portfolio, it’s easy to fall victim to choice overload. If you don’t know what to choose and how to start, especially when some services overlay other, select two or three tools and start playing with them. Once you’re familiar with these, add another one and next to your toolbox, and finally, you will build your tailor-made Big Data solution.