CISUC

A Survey on Reliable Distributed Communication

Authors

Abstract

From entertainment to personal communication, and from business to safety-critical applications, the world increasingly relies on distributed systems. Despite looking simple, distributed systems hide a major source of complexity: tolerating faults and component crashes is very difficult, due to the incompleteness of (remote) knowledge. The need to overcome this problem, and provide different guarantees to applications, sparked a huge research effort and resulted in a large body of communication protocols, and middleware. Thus, it is worthwhile to survey the state of the art in distributed systems, with a particular emphasis on reliable communication. We discuss key concepts in reliable communication, such as interaction patterns (e.g., one-way vs. request-response, synchronous vs. asynchronous), reliability semantics (e.g., at-least-once, at-most-once), and reliability targets (e.g., message, conversation), and we analyze a wide set of current communication solutions, which map to the different concepts. Building on the concepts, we analyze applications that have different reliable communication needs. As a result, we observe that, in most cases, elaborate communication solutions offering superior guarantees are purely academic efforts that cannot compete with the popularity and maturity of established, albeit poorer solutions. Based on our analysis, we identify and discuss open research topics in this area.

Keywords

Distributed Systems, Reliable Communication, Fault-tolerance, Reliability Mechanisms

Subject

Reliable Distributed Communication

Journal

Elsevier Journal of Systems and Software, March 2018

DOI


Cited by

No citations found