Accounting for Heteroscedasticity in Big Data

Authors

Francisco Nibau Antunes
Aidan O’Sullivan
Filipe Rodrigues
Francisco Câmara Pereira

Abstract

For regression problems, the general practice is to consider a constant variance
of the error term across all data. This aims to simplify an often complicated model
and relies on the assumption that this error is independent of the input variables. This
property is known as homoscedascity. On the other hand, in the real world, this is often
a naive assumption, as we are rarely able to exhaustively include all true explanatory
variables for a regression. While Big Data is bringing new opportunities for regression
applications, ignoring this limitation may lead to biased estimators and innacurate
confidence and prediction intervals.
This paper aims to study the treatment of non-constant variance in regression
models, also known as heteroscedascity. We apply two methodologies: integration of
conditional variance within the regression model itself; treat the regression model as a
black box and use a meta-model that analyses the error separately. We will compare
the performance of both approaches using two heteroscedastic datasets.
Although accounting for heteroscedasticity in data increases the complexity of the
models used, we show that it can greatly improve the quality of the predictions, and
more importantly, it can provide a proper notion of uncertainty or “confidence” associated
with those predictions. We also discuss the feasibility of the solutions in a Big
Data context.

Keywords

Heteroscedasticity, Gaussian Processes, Quantile Regression

Related Project

InfoCrowds - Social Web Information Retrieval for crowds mobility management

Conference

Workshop on Big Data and Urban Informatics, August 2014

DOI

Cited by

No citations found