
ML & AI for DevOps
- English
- Professional
- Online : 1250CHF
This training covers essential AI and Machine Learning techniques applied to modern DevOps and Site Reliability Engineering. It offers a detailed and practical exploration of how ML can enhance observability, incident prediction, anomaly detection, forecasting, and intelligent log analysis. Through hands-on labs, participants learn to build and deploy ML pipelines using real-world operational data and to integrate them into DevOps workflows with tools such as Prometheus, Grafana, and FastAPI.
Day 1 : Foundations of Machine Learning for DevOps
- Understand how and where ML enhances DevOps workflows (incident prediction, proactive scaling, intelligent alerting)
- Learn how to build and evaluate ML models for reliability scoring using production data (metrics/logs)
- Set up an end-to-end ML pipeline with versioning, tracking, and containerization for DevOps environments
- Train and evaluate ML models (e.g., Random Forest, XGBoost) on system metrics/logs for reliability scoring
- Set up ML experiment tracking with MLflow and version control using Git
- Containerize the ML application with Docker for integration into DevOps pipelines
Day 2 : Time Series Modeling and Forecasting
- Model system metrics (CPU, memory, latency) as time series for forecasting and anomaly detection
- Compare traditional and ML-based models (ARIMA, Prophet, LSTM) for operational forecasting
- Explore anomaly detection approaches and alerting strategies based on log and metric patterns
- Build a forecasting pipeline for system metrics using models like Prophet or Facebook Kats
- Implement anomaly detection using Isolation Forest or DBSCAN on metric/log data
- Deploy a FastAPI service for real-time predictions and visualize alerts in Grafana
Day 3 : NLP for Logs and Intelligent Incident Analysis
- Apply NLP techniques (TF-IDF, Word2Vec, BERT) to logs for root cause analysis and Classification
- Use ML/NLP to group similar incidents and extract summaries for escalation or resolution
- Build a unified intelligent monitoring system combining forecasting, anomaly detection, and NLP
- Build a log classification model for error categorization and incident grouping
- Develop an NLP component to extract relevant information from logs and generate incident summaries
- Integrate the full monitoring pipeline (forecasting, anomaly detection, NLP) and evaluate on real or simulated data
This course is available online and onsite and fully customizable to your needs.
*The course is also available in French.

Theory
