Time-Stratified Sampling for Approximate Answers to Aggregate Queries
Authors
Abstract
In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on samples of the full data. However, uniformly extracted samples often do not guarantee a number of samples in grouping interval estimations to yield acceptable accuracy. This is crucial in most practical less-aggregated analyses, which are mostly based on recent data (e.g. forecasting, performance analysis).We propose the use of time-interval stratified samples (TISS), a simple time-biased sampling strategy that produces summaries biased towards recency. This bias minimizes the representational issue and improves the accuracy in important less-aggregated analysis without significantly deteriorating aggregated analysis on older data.
TISS obtains a much better accuracy than either uniform or the recently proposed congressional samples (CS) for queries analyzing recent data, while maintaining full ad-hoc usability. Furthermore, we show that TISS can be coupled with CS to combine very accurate TISS-like estimations with CS minimal representation guarantees on defined query patterns (TISS-CS). The use of CS in this context is important to provide an additional minimal representation guarantee on the less-well represented older data.