Sep 26, 2018 2 min read

Druid distributed data store – what is it?

Table of Contents

Druid – what is it?
Data storage and partitioning
Data aggregation and querying
Where to use?
Who uses?

Druid – what is it?

Druid is a distributed, column-based data-store designed to allow BI/OLAP like queries on massive volumes of data. Designed for quick and efficient querying, aggregation and analysis of time-series that is series of timestamped data points. System architecture allows for extremely low-latency queries being run against very large datasets.

Data storage and partitioning

Druid partitions data objects into segments based on the data timestamp. Sizing segment files is usually a part of system optimization as it has impact on system performance. However Druid documentation recommends segment file sizes between 300 – 700MB. Druid allows multiple segments for the same interval in which case the segments form a block.
Each data object in Druid can be divided into three separate parts:
• Timestamp columns
• Dimension columns – attributes describing the context of data like country, product etc.
• Metric columns – numerical columns with quantitative assessment of an event being subject to analysis and aggregation

Data aggregation and querying

Druid employs both exact and approximate calculation algorithms such as:

HyperLogLog – distinct count approximation
Theta sketches – approximating results of set operations (union, intersection etc.)
TopN – quick ranking algorithm

Approximate algorithms allow for significant calculation time reduction while sustaining good quality of results (~98% accuracy) which is acceptable in many applications.

Druid uses JSON over HTTP as a query language which makes it quite difficult for end-users to effectively query and analyze the data. Using a third-party data query tool, such as Apache Superset or Pivot is highly recommended.

Where to use?

Druid is a highly acclaimed tool in multiple areas such as: network activity analysis, cloud security, IoT sensor data analysis and others. Apache Druid is the tool of choice for:
• Highly efficient data time-series aggregation and analysis
• Real time data analytics
• Extremely large data volume (hundreds of millions of events)
• Highly Available solution

Who uses?

Thanks to its performance, Druid was quickly adopted by multiple companies including Netflix, Alibaba, AirBnB, eBay, Cisco, PayPal, Yahoo and many more.