Select Page

Hadoop

Apache Hadoop

Apache Hadoop is an open source platform for distributed storage and processing large data sets.

The main advantages of Apache Hadoop are:

scalability

because it is a free platform based on cluster architecture, the Hadoop cluster may be easily expanded with additional servers in a way that is transparent to already saved and defined processes

flexibility

the multitude of tools comprising the Hadoop ecosystem makes it possible to process both structured and unstructured data (mostly the case in Big Data)

fault tolerance

due to data replication and tools enabling the cluster to work in High Availability mode, it offers coherent and continuous access to the data stored, despite malfunctions of any of the servers

data processing speed

distributed data processing causes processing huge data volumes to be much quicker than is the case with standard ETL mechanism and batch processing

efficient resource management

tasks are appropriately allocated among the machines so as to fully utilize the cluster power.

These features make Apache Hadoop one of the most commonly chosen solutions for building backbones of complex Big Data solutions.

Apache Hadoop is used by such companies as Adobe, Ebay, Facebook, Google, IBM, Spotify, Twitter, Yahoo and many other leading IT companies.

The main components

Hadoop consists of four core modules:

Hadoop Common

a set of libraries and tools to support the other modules

Hadoop Distributed File System (HDFS)

a distributed file system, which breaks data into smaller blocks and stores them in an evenly distributed way to cluster nodes with appropriate replication levels

MapReduce

a programming paradigm implementation which makes it possible to process large amounts of data in a distributed way

YARN (Yet Another Resource Navigator)

a platform to manage cluster resources.

Apart from the basic modules, referred to above, the entire Hadoop ecosystem comprises a wide selection of applications facilitating access to cluster data, its processing, service monitoring, cluster administration and access management. The most popular tools include:

• Hive
• HBase
• Pig
• Ambari
• Ranger
• Hue

• Spark
• Oozie
• Sqoop
• ZooKeeper
• Flume

The most popular distributions

Hadoop in its open source form has been developed by the Apache Software Foundation. However, apart from the standard solution, many companies offer their own distributions based on Apache Hadoop – they are upgraded with additional tools comprising a ready-for-use Big Data ecosystem. A further advantage of such distributions is support for the entire ecosystem and not just its particular modules.

The most popular Hadoop platform distributions are:

Cloudera Distribution Including Apache Hadoop (CDH)

Hortonworks Data Platform (HDP)

MapR Converged Data Platform

BlueSoft’s experience with the Hadoop platform

BlueSoft is constantly building its Big Data competence portfolio and actively participates in Hadoop-based projects. We have ample experience in both designing and implementing Hadoop clusters, creating applications enabling data aggregation and processing, as well as creating comprehensive analytical models.

See other technologies, which we use in this area

NoSQL / BigData