Reducing AutoML Pipeline Failures in Heterogeneous Cloud Environments via Retrieval-Augmented Fault Prediction

Zihan Liu; Ruihan Ma

doi:10.54097/ath09617

Authors

Zihan Liu Department of Computer Science, George Mason University, United States
Ruihan Ma Department of Computer Science, Portland State University, United States

DOI:

https://doi.org/10.54097/ath09617

Keywords:

Automated Machine Learning, Cloud Computing, Fault Prediction, Retrieval-Augmented Generation, Pipeline Failure, Heterogeneous Environments, Fault Tolerance

Abstract

Automated Machine Learning (AutoML) pipelines deployed in heterogeneous cloud environments are increasingly susceptible to runtime failures caused by resource contention, hardware diversity, and dynamic workload fluctuations. These failures impose substantial operational overhead and compromise the reliability of large-scale machine learning workflows. This paper introduces a Retrieval-Augmented Fault Prediction (RAFP) framework designed to anticipate AutoML pipeline failures by coupling a dense vector retrieval module over a structured historical failure knowledge base with a gradient-boosted fault classifier. The retrieval module encodes contextually similar past failure records as auxiliary features, enabling the prediction model to condition its output on domain-specific failure patterns rather than relying solely on real-time telemetry signals. The RAFP framework is evaluated against four baseline systems — logistic regression (LR), random forest (RF), gradient boosting (GB), and long short-term memory (LSTM) networks — using the Google Cluster Trace v3 and the Alibaba Cluster Trace 2018. Experimental results demonstrate that RAFP achieves a macro-averaged F1-score of 0.891 and reduces pipeline disruption events by 37.4% relative to the strongest baseline. Ablation studies confirm that the retrieval component yields consistent performance improvements across heterogeneous node configurations. These findings indicate that retrieval-augmented reasoning represents a practically effective complement to existing proactive fault prediction architectures for cloud-hosted AutoML systems.

Downloads

Download data is not yet available.

References

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

[2] He, X., Zhao, K., & Chu, X. (2021). AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212, 106622. https://doi.org/10.1016/j.knosys.2020.106622

[3] Notaro, P., Cardoso, J., & Gerndt, M. (2021). A survey of aiops methods for failure management. ACM Transactions on Intelligent Systems and Technology, 12(6), 1–45. https://doi.org/10.1145/3470649

[4] Zhao, W., Chen, T., Yang, J. S., & Qiu, L. (2026). AutoML-Pipeline: A RAG-enhanced code generation framework with pre-validation for cloud-native machine learning workflows. IEEE Access. https://doi.org/10.1109/ACCESS.2026.xxxxxx

[5] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., ... & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.

[6] Jassas, M. S., & Mahmoud, Q. H. (2022). Analysis of job failure and prediction model for cloud computing using machine learning. Sensors, 22(5), 2035. https://doi.org/10.3390/s22052035

[7] Tengku Asmawi, T. N., Ismail, A., & Shen, J. (2022). Cloud failure prediction based on traditional machine learning and deep learning. Journal of Cloud Computing, 11(1), 47. https://doi.org/10.1186/s13677-022-00330-8

[8] Tuli, S., Casale, G., & Jennings, N. R. (2022). Tranad: Deep transformer networks for anomaly detection in multivariate time series data. arXiv preprint arXiv:2201.07284.

[9] Han, S., Wu, J., Xu, E., He, C., Lee, P. P., Qiang, Y., ... & Li, R. (2019). Robust data preprocessing for machine-learning-based disk failure prediction in cloud production environments. arXiv preprint arXiv:1912.09722.

[10] Zöller, M. A., & Huber, M. F. (2021). Benchmark and survey of automated machine learning frameworks. Journal of Artificial Intelligence Research, 70, 409–472. https://doi.org/10.1613/jair.1.12876

[11] Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., & Hutter, F. (2022). Auto-sklearn 2.0: Hands-free automl via meta-learning. Journal of Machine Learning Research, 23(261), 1–61.

[12] Real, E., Liang, C., So, D., & Le, Q. (2020). Automl-zero: Evolving machine learning algorithms from scratch. In International Conference on Machine Learning (pp. 8007–8019). PMLR.

[13] Wever, M. (2024). Automated machine learning for multi-label classification. arXiv preprint arXiv:2402.18198.

[14] Almheiri, Z., Meguid, M., & Zayed, T. (2021). Failure modeling of water distribution pipelines using meta-learning algorithms. Water Research, 205, 117680. https://doi.org/10.1016/j.watres.2021.117680

[15] Jassas, M. S., & Mahmoud, Q. H. (2019). Failure characterization and prediction of scheduling jobs in google cluster traces. In 2019 IEEE 10th GCC Conference & Exhibition (GCC) (pp. 1–7). IEEE. https://doi.org/10.1109/GCC47125.2019.9070259

[16] Riganelli, O., Saltarel, P., Tundo, A., Mobilio, M., & Mariani, L. (2021). Cloud failure prediction with hierarchical temporal memory: an empirical assessment. In 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) (pp. 785–790). IEEE. https://doi.org/10.1109/ICMLA52935.2021.9680222

[17] Cotroneo, D., De Simone, L., Liguori, P., & Natella, R. (2020). Fault injection analytics: A novel approach to discover failure modes in cloud-computing systems. IEEE Transactions on Dependable and Secure Computing, 19(3), 1476–1491. https://doi.org/10.1109/TDSC.2020.3013011

[18] Ma, M., Xu, J., Wang, Y., Chen, P., Zhang, Z., & Wang, P. (2020). Automap: Diagnose your microservice-based web applications automatically. In Proceedings of The Web Conference 2020 (pp. 246–258). https://doi.org/10.1145/3366423.3380091

[19] Wu, L., Tordsson, J., Elmroth, E., & Kao, O. (2020). Microrca: Root cause localization of performance issues in microservices. In IEEE/IFIP Network Operations and Management Symposium (NOMS). https://doi.org/10.1109/NOMS47738.2020.9110391

[20] Ding, J., Shen, Z., & Liu, W. (2026). Game-Theoretic Cost-Sensitive Adversarial Training for Robust Cloud Intrusion Detection Against GAN-Based Evasion Attacks. Applied Sciences, 16(8), 3944. https://doi.org/10.3390/app16083944

[21] Ping, W., Jiao, Y., Fan, H., & Zhang, X. (2026). Multimodal Fraud Detection in Financial Statements: A Trimodal Attention Network with Contrastive Evidence Chain Construction. IEEE Access. https://doi.org/10.1109/ACCESS.2026.xxxxxx

[22] Teng, D., Rhee, M., Qin, Y., Zi, B., & Liu, W. (2026). SW-SpeedDLM: Sliding-Window Speculative Decoding for Diffusion Language Models under Long-Context Constraints. Mathematics. https://doi.org/10.3390/mathxxxx

[23] Zhang, F., Guo, Z., Ding, J., Yang, J., & Liu, W. (2026). Adaptive Sensor Fusion for Robust Perception in Extreme Weather: A Gated Vision and LiDAR Integration Framework. Sensors. https://doi.org/10.3390/s26xxxx

[24] Chen, J., Liu, J., Liang, Y., & Zhou, M. (2026). HeteroGCL: A Heterogeneous Graph Contrastive Learning Framework for Scalable and Sustainable Cryptocurrency AML. Applied Sciences, 16(6), 2860. https://doi.org/10.3390/app16062860

[25] Zhang, S., Qiu, L., & Zeng, Z. (2026). Physics-Data Synergy in Structural Health Monitoring: A Multi-Scale Graph Contrastive Framework with Temperature-Adaptive Fusion. IEEE Access. https://doi.org/10.1109/ACCESS.2026.xxxxxx

[26] Liu, C. L., Tseng, C. J., Huang, T. H., Yang, J. S., & Huang, K. B. (2023). A multi-task learning model for building electrical load prediction. Energy and Buildings, 278, 112601. https://doi.org/10.1016/j.enbuild.2022.112601

Reducing AutoML Pipeline Failures in Heterogeneous Cloud Environments via Retrieval-Augmented Fault Prediction

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Cover

Indexing

Keywords

Latest publications

Information