TY - JOUR
T1 - Fatriot
T2 - Fault-tolerant MEC architecture for mission-critical systems using a SmartNIC
AU - Park, Taejune
AU - You, Myoungsung
AU - Kim, Jinwoo
AU - Lee, Seungsoo
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/11
Y1 - 2024/11
N2 - Multi-access edge computing (MEC), deploying cloud infrastructures proximate to end-devices and reducing latency, takes pivotal roles for mission-critical services such as smart grids, self-driving cars, and healthcare. Ensuring fault-tolerance is paramount for mission-critical services, as failures in these services can lead to fatal accidents and blackouts. However, the distributed nature of MEC architectures makes them more susceptible to failures than traditional cloud systems. Existing research in this field has focused on enhancing robustness to prevent failures in MEC systems rather than restoring them from failure conditions. To bridge this gap, we introduce Fatriot, a SmartNIC-based architecture designed to ensure fault-tolerance in MEC systems. Fatriot actively monitors for anomalies on MEC hosts and seamlessly redirects incoming service traffic to backup hosts upon detecting failures. Operating as a stand-alone solution on a SmartNIC, Fatriot guarantees the continuous operation of its fault-tolerance mechanism, even during severe errors (e.g., kernel failure) on the MEC host, maintaining uninterrupted service in mission-critical services. Our prototype of Fatriot, implemented on the NetFPGA-SUME, demonstrates effective mitigation of various failure scenarios, achieving this with minimal overhead to services (less than 1%).
AB - Multi-access edge computing (MEC), deploying cloud infrastructures proximate to end-devices and reducing latency, takes pivotal roles for mission-critical services such as smart grids, self-driving cars, and healthcare. Ensuring fault-tolerance is paramount for mission-critical services, as failures in these services can lead to fatal accidents and blackouts. However, the distributed nature of MEC architectures makes them more susceptible to failures than traditional cloud systems. Existing research in this field has focused on enhancing robustness to prevent failures in MEC systems rather than restoring them from failure conditions. To bridge this gap, we introduce Fatriot, a SmartNIC-based architecture designed to ensure fault-tolerance in MEC systems. Fatriot actively monitors for anomalies on MEC hosts and seamlessly redirects incoming service traffic to backup hosts upon detecting failures. Operating as a stand-alone solution on a SmartNIC, Fatriot guarantees the continuous operation of its fault-tolerance mechanism, even during severe errors (e.g., kernel failure) on the MEC host, maintaining uninterrupted service in mission-critical services. Our prototype of Fatriot, implemented on the NetFPGA-SUME, demonstrates effective mitigation of various failure scenarios, achieving this with minimal overhead to services (less than 1%).
KW - Mission-critical system
KW - Multi-access Edge Computing (MEC)
KW - Programmable data plane
UR - https://www.scopus.com/pages/publications/85200737354
U2 - 10.1016/j.jnca.2024.103978
DO - 10.1016/j.jnca.2024.103978
M3 - Article
AN - SCOPUS:85200737354
SN - 1084-8045
VL - 231
JO - Journal of Network and Computer Applications
JF - Journal of Network and Computer Applications
M1 - 103978
ER -