The main advantages of Apache Hadoop are:
- scalability because it is a free platform based on cluster architecture, the Hadoop cluster may be easily expanded with additional servers in a way that is transparent to already saved and defined processes
- flexibility the multitude of tools comprising the Hadoop ecosystem makes it possible to process both structured and unstructured data (mostly the case in Big Data)
- fault tolerance due to data replication and tools enabling the cluster to work in High Availability mode, it offers coherent and continuous access to the data stored, despite malfunctions of any of the servers
- data processing speed distributed data processing causes processing huge data volumes to be much quicker than is the case with standard ETL mechanism and batch processing
- efficient resource management tasks are appropriately allocated among the machines so as to fully utilize the cluster power.
These features make Apache Hadoop one of the most commonly chosen solutions for building backbones of complex Big Data solutions.
Apache Hadoop is used by such companies as Adobe, Ebay, Facebook, Google, IBM, Spotify, Twitter, Yahoo and many other leading IT companies.
Hadoop consists of four core modules:
- Hadoop Common a set of libraries and tools to support the other modules
- Hadoop Distributed File System (HDFS) a distributed file system, which breaks data into smaller blocks and stores them in an evenly distributed way to cluster nodes with appropriate replication levels
- MapReduce a programming paradigm implementation which makes it possible to process large amounts of data in a distributed way
- YARN (Yet Another Resource Navigator) a platform to manage cluster resources.
Apart from the basic modules, referred to above, the entire Hadoop ecosystem comprises a wide selection of applications facilitating access to cluster data, its processing, service monitoring, cluster administration and access management.
The most popular tools include:
- Hive
- HBase
- Pig
- Ambari
- Ranger
- Hue
- Spark
- Oozie
- Sqoop
- ZooKeeper
- Flume
The most popular distributions
Hadoop in its open source form has been developed by the Apache Software Foundation. However, apart from the standard solution, many companies offer their own distributions based on Apache Hadoop – they are upgraded with additional tools comprising a ready-for-use Big Data ecosystem. A further advantage of such distributions is support for the entire ecosystem and not just its particular modules.
The most popular Hadoop platform distributions are:
- Cloudera Distribution Including Apache Hadoop (CDH)
- Hortonworks Data Platform (HDP)
- MapR Converged Data Platform
BlueSoft’s experience with the Hadoop platform
BlueSoft is constantly building its Big Data competence portfolio and actively participates in Hadoop-based projects. We have ample experience in both designing and implementing Hadoop clusters, creating applications enabling data aggregation and processing, as well as creating comprehensive analytical models.