CISUC

The Effects of Soft Errors and Mitigation Strategies for Virtualization Servers

Authors

Abstract

Virtualized servers compose the majority of cloud computing environments, where these nodes are used to host multiple clients over the same hardware. Many organizations run online applications by hiring elastic computing resources in order to match demand while reducing fixed costs. However, such organizations are unlikely to take advantage of these benefits for critical applications, as it would expose them to several risks. Among other threats, soft errors are a concern in large-scale reliable servers and are expected to become more frequent as a consequence of smaller transistors and lower operating voltages of integrated circuits. This paper characterizes virtualized servers of cloud environments in presence of soft errors. Using fault injection, we collect experimental data to determine the failure modes of applications, operating systems, VMs and hypervisor. The analysis exposes distinct failure modes, ranging from crash failures of a single virtual machine to silent data corruption in permanent storage. The most frequent failure mode, observed in 10–30% of injected errors, consists of a hang affecting multiple virtual machines. Given that such failures are a primary cause of downtime, we develop and evaluate a recovery mechanism which uses online testing and recovers a server from all hangs by rebooting its hypervisor.

Keywords

Virtualization, fault injection, cloud computing, fault tolerance, dependability.

Subject

Cloud computing availability and dependability

Related Project

AESOP - Autonomic Service Operation

Journal

IEEE Transactions on Cloud Computing, February 2020

DOI


Cited by

No citations found