Part of the Orange Group

Back to Blogroll
Big Data

1 min read

Druid distributed data store – what is it?

Article written by:

Druid distributed data store – what is it?

Druid – what is it?

Druid is a distributed, column-based data-store designed to allow BI/OLAP like queries on massive volumes of data. Designed for quick and efficient querying, aggregation and analysis of time-series that is series of timestamped data points. System architecture allows for extremely low-latency queries being run against very large datasets.

Data storage and partitioning

Druid partitions data objects into segments based on the data timestamp. Sizing segment files is usually a part of system optimization as it has impact on system performance. However Druid documentation recommends segment file sizes between 300 – 700MB. Druid allows multiple segments for the same interval in which case the segments form a block.
Each data object in Druid can be divided into three separate parts:
• Timestamp columns
• Dimension columns – attributes describing the context of data like country, product etc.
• Metric columns – numerical columns with quantitative assessment of an event being subject to analysis and aggregation

Data aggregation and querying

Druid employs both exact and approximate calculation algorithms such as:

  • HyperLogLog – distinct count approximation
  • Theta sketches – approximating results of set operations (union, intersection etc.)
  • TopN – quick ranking algorithm

Approximate algorithms allow for significant calculation time reduction while sustaining good quality of results (~98% accuracy) which is acceptable in many applications.

Druid uses JSON over HTTP as a query language which makes it quite difficult for end-users to effectively query and analyze the data. Using a third-party data query tool, such as Apache Superset or Pivot is highly recommended.

Where to use?

Druid is a highly acclaimed tool in multiple areas such as: network activity analysis, cloud security, IoT sensor data analysis and others. Apache Druid is the tool of choice for:
• Highly efficient data time-series aggregation and analysis
• Real time data analytics
• Extremely large data volume (hundreds of millions of events)
• Highly Available solution

Who uses?

Thanks to its performance, Druid was quickly adopted by multiple companies including Netflix, Alibaba, AirBnB, eBay, Cisco, PayPal, Yahoo and many more.

You might also be interested in

Let's talk business

Subscribe to our newsletter.

Receive the latest updates from us.