The Effects of Soft Errors and Mitigation Strategies for Virtualization Servers
Authors
Abstract
Virtualized servers compose the majority of cloud computing environments, where these nodes are used to host multiple clients over the same hardware. Many organizations run online applications by hiring elastic computing resources in order to match demand while reducing fixed costs. However, such organizations are unlikely to take advantage of these benefits for critical applications, as it would expose them to several risks. Among other threats, soft errors are a concern in large-scale reliable servers and are expected to become more frequent as a consequence of smaller transistors and lower operating voltages of integrated circuits. This paper characterizes virtualized servers of cloud environments in presence of soft errors. Using fault injection, we collect experimental data to determine the failure modes of applications, operating systems, VMs and hypervisor. The analysis exposes distinct failure modes, ranging from crash failures of a single virtual machine to silent data corruption in permanent storage. The most frequent failure mode, observed in 10–30% of injected errors, consists of a hang affecting multiple virtual machines. Given that such failures are a primary cause of downtime, we develop and evaluate a recovery mechanism which uses online testing and recovers a server from all hangs by rebooting its hypervisor.
Keywords
Virtualization, fault injection, cloud computing, fault tolerance, dependability.
Subject
Cloud computing availability and dependability
Related Project
AESOP - Autonomic Service Operation
Journal
IEEE Transactions on Cloud Computing, February 2020
DOI
Cited by
No citations found