Leveraging 24/7 Availability and Performance for Distributed Real-Time Data Warehouses
Authors
Abstract
Nowadays, most enterprises require near real-time Data Warehouses (DWs) that are able to deal with continuous updates while providing 24/7 availability. Distributed data using round-robin algorithms on clusters of shared-nothing machines is commonly used for improving performance. In this paper, we propose a solution for distributed DW databases that ensures its continuous availability and deals with frequent data loading requirements, introducing small performance overhead. We use a data striping and replication architecture to distribute portions of each fact table among pairs of slave nodes. Each slave node is an exact replica of its partner in the pair. This allows balancing query execution and replacing any defective node, ensuring the system’s continuous availability. The size of each portion in a given node depends on its individual features, namely performance benchmark measures and dedicated database RAM. The estimated cost for executing each query workload in each slave node is also used for balancing and optimizing query performance. We include experiments using the TPC-H decision support benchmark to evaluate the scalability of our solution and show that it outperforms standard round-robin distributed DW setups.
Keywords
Real-time data warehousing; availability; fault tolerance; data replication and redundancy; distributed and parallel databases; load balancing; performance optimization
Subject
Real-Time Data Warehousing
Conference
COMPSAC 2012 - IEEE Signature Conference on Computer Software & Applications, July 2012
PDF File
Cited by
No citations found