Predictive maintenance is an invaluable tool to preserve the health of mission critical assets while minimizing the operational costs of scheduled intervention. Artificial intelligence techniques have been shown to be effective at treating large volumes of data, such as the ones collected by the sensors typically present in equipment. In this work, we aim to identify and summarize existing publications in the field of predictive maintenance that explore machine learning and deep learning algorithms to improve the performance of failure classification and detection. We show a significant upward trend in the use of deep learning methods of sensor data collected by mission critical assets for early failure detection to assist predictive maintenance schedules. We also identify aspects that require further investigation in future works, regarding exploration of life support systems for supercomputing assets and standardization of performance metrics.
Recommended citation: Lima ALCD, Aranha VM, Carvalho CJL, Nascimento EGS. (2021) "Smart predictive maintenance for high-performance computing systems: a literature review." Journal of Supercomputing 77, 13494–13513 (2021). https://doi.org/10.1007/s11227-021-03811-7