Accounting for Heteroscedasticity in Big Data
Authors
Abstract
For regression problems, the general practice is to consider a constant varianceof the error term across all data. This aims to simplify an often complicated model
and relies on the assumption that this error is independent of the input variables. This
property is known as homoscedascity. On the other hand, in the real world, this is often
a naive assumption, as we are rarely able to exhaustively include all true explanatory
variables for a regression. While Big Data is bringing new opportunities for regression
applications, ignoring this limitation may lead to biased estimators and innacurate
confidence and prediction intervals.
This paper aims to study the treatment of non-constant variance in regression
models, also known as heteroscedascity. We apply two methodologies: integration of
conditional variance within the regression model itself; treat the regression model as a
black box and use a meta-model that analyses the error separately. We will compare
the performance of both approaches using two heteroscedastic datasets.
Although accounting for heteroscedasticity in data increases the complexity of the
models used, we show that it can greatly improve the quality of the predictions, and
more importantly, it can provide a proper notion of uncertainty or “confidence” associated
with those predictions. We also discuss the feasibility of the solutions in a Big
Data context.