Leveraging 24/7 Availability and Performance for Distributed Real-Time Data Warehouses

Authors

Ricardo Jorge Santos
Jorge Bernardino
Marco Vieira

Abstract

Nowadays, most enterprises require near real-time Data Warehouses (DWs) that are able to deal with continuous updates while providing 24/7 availability. Distributed data using round-robin algorithms on clusters of shared-nothing machines is commonly used for improving performance. In this paper, we propose a solution for distributed DW databases that ensures its continuous availability and deals with frequent data loading requirements, introducing small performance overhead. We use a data striping and replication architecture to distribute portions of each fact table among pairs of slave nodes. Each slave node is an exact replica of its partner in the pair. This allows balancing query execution and replacing any defective node, ensuring the system’s continuous availability. The size of each portion in a given node depends on its individual features, namely performance benchmark measures and dedicated database RAM. The estimated cost for executing each query workload in each slave node is also used for balancing and optimizing query performance. We include experiments using the TPC-H decision support benchmark to evaluate the scalability of our solution and show that it outperforms standard round-robin distributed DW setups.

Keywords

Real-time data warehousing; availability; fault tolerance; data replication and redundancy; distributed and parallel databases; load balancing; performance optimization

Subject

Real-Time Data Warehousing

Conference

COMPSAC 2012 - IEEE Signature Conference on Computer Software & Applications, July 2012

PDF File

Cited by

No citations found