My CMS

ML & AI for DevOps

This training covers essential AI and Machine Learning techniques applied to modern DevOps and Site Reliability Engineering. It offers a detailed and practical exploration of how ML can enhance observability, incident prediction, anomaly detection, forecasting, and intelligent log analysis. Through hands-on labs, participants learn to build and deploy ML pipelines using real-world operational data and to integrate them into DevOps workflows with tools such as Prometheus, Grafana, and FastAPI.

Day 1 : Foundations of Machine Learning for DevOps

  • Understand how and where ML enhances DevOps workflows (incident prediction, proactive scaling, intelligent alerting)
  • Learn how to build and evaluate ML models for reliability scoring using production data (metrics/logs)
  • Set up an end-to-end ML pipeline with versioning, tracking, and containerization for DevOps environments
  • Train and evaluate ML models (e.g., Random Forest, XGBoost) on system metrics/logs for reliability scoring
  • Set up ML experiment tracking with MLflow and version control using Git
  • Containerize the ML application with Docker for integration into DevOps pipelines

Day 2 : Time Series Modeling and Forecasting

  • Model system metrics (CPU, memory, latency) as time series for forecasting and anomaly detection
  • Compare traditional and ML-based models (ARIMA, Prophet, LSTM) for operational forecasting
  • Explore anomaly detection approaches and alerting strategies based on log and metric patterns
  • Build a forecasting pipeline for system metrics using models like Prophet or Facebook Kats
  • Implement anomaly detection using Isolation Forest or DBSCAN on metric/log data
  • Deploy a FastAPI service for real-time predictions and visualize alerts in Grafana

Day 3 : NLP for Logs and Intelligent Incident Analysis

  • Apply NLP techniques (TF-IDF, Word2Vec, BERT) to logs for root cause analysis and Classification
  • Use ML/NLP to group similar incidents and extract summaries for escalation or resolution
  • Build a unified intelligent monitoring system combining forecasting, anomaly detection, and NLP
  • Build a log classification model for error categorization and incident grouping
  • Develop an NLP component to extract relevant information from logs and generate incident summaries
  • Integrate the full monitoring pipeline (forecasting, anomaly detection, NLP) and evaluate on real or simulated data

This course is available online and onsite and fully customizable to your needs.
*The course is also available in French.

Theory

Practical Labs

Learning outcomes:

This training enables DevOps engineers, SREs, and platform teams to leverage machine learning and AI to enhance observability, incident prediction, root cause analysis, and self-healing automation in modern infrastructures.

Your profile and prerequisites:

  • DevOps engineers
  • SREs engineers
  • Platform engineers
  • Observability engineers
  • Technical team leads working on scalable systems and infrastructure.

With knowledge of

  • Familiarity with DevOps practices and tooling
  • Experience with Python scripting
  • Basic understanding of system metrics, logs, and monitoring tools
  • Some exposure to machine learning is a plus

Learning outcomes:

You will learn how to effectively apply SRE hard and soft skills in your work and architecture.

  1. Understand what SRE is, why it is important and learn how it can be applied in practise with the Digital Highway for Software Delivery.
  2. Learn how to understand the inner working of your application in production through applying SLO engineering principles and Observability.
  3. Learn how to continuously deliver software into production and how to embrace the shift right paradigm through Continuous Verification and Rollbacks 3.

Your profile and prerequisites:

  • Software engineers 
  • DevOps engineers
  • System engineers
  • ML Architects

With knowledge of

  • Software Engineering skills (OOP, Scripting, ad ac code,…)
  • System Engineering skills (OS, Network, Deployment, Security, Monitoring,…)
  • Advantageous: Performance Analysis, Release Engineering, APM/Infra Monitoring Distributed/ Reliable Architect Design