Skip to content

ai-infra-curriculum/ai-infra-engineer-learning

AI Infrastructure Engineer - Learning Path

License Progress Projects Duration

Master AI Infrastructure Engineering through hands-on projects and practical learning

PrerequisitesGetting StartedCurriculumProjectsResources


🎯 Overview

This repository contains a complete, production-ready learning path for becoming an AI Infrastructure Engineer. Through comprehensive modules, real-world projects, and production-grade code stubs with educational TODO comments, you'll develop the skills needed to build, deploy, and maintain ML infrastructure at scale.

Repository Status:100% COMPLETE - All modules and projects ready for learning!

What You'll Master

  • Build ML Infrastructure from scratch (Docker, Kubernetes, cloud platforms)
  • Deploy Production ML Systems with auto-scaling and comprehensive monitoring
  • Implement End-to-End MLOps pipelines (Airflow, MLflow, DVC)
  • Deploy Cutting-Edge LLM Infrastructure (vLLM, RAG, vector databases)
  • Scale Training with distributed systems and GPU clusters
  • Monitor and Troubleshoot complex ML systems in production
  • Optimize Costs across cloud providers (60-80% savings possible)

Why This Learning Path?

  • 🎓 Industry-Aligned: Based on actual job requirements from FAANG and top tech companies
  • 💻 Hands-On: Code stubs with TODO comments guide you through real implementations
  • 🏗️ Production-Ready: Learn patterns used at Netflix, Uber, Airbnb, OpenAI
  • 📊 Career-Focused: Directly maps to $120k-$180k AI Infrastructure Engineer roles
  • 🚀 Progressive: 10 modules building from basics to advanced LLM infrastructure
  • 🔥 Modern Stack: 2024-2025 technologies (vLLM, RAG, GPU optimization)

✨ What's New

Recently Added Content:

  • 📝 Comprehensive Quizzes for modules 102-110 (265+ questions)
    • Module 102: Cloud Computing (mid-module + final, 50 questions)
    • Module 103: Containerization (25 questions)
    • Module 104: Kubernetes (30 questions)
    • Module 105: Data Pipelines (25 questions)
    • Module 106: MLOps (30 questions)
    • Module 107: GPU Computing (25 questions)
    • Module 108: Monitoring (25 questions)
    • Module 109: IaC (25 questions)
    • Module 110: LLM Infrastructure (30 questions)
  • 📋 Technology Versions Guide - Complete specifications for 100+ tools
  • 🗺️ Curriculum Cross-Reference - Mapping to Junior track
  • 📈 Career Progression Guide - Engineer to Principal roadmap

📊 What's Included

10 Complete Learning Modules (130 Files)

Module Topic Hours Status Quiz
01 Foundations 50h ✅ Complete (15 files) ✅ 30Q
02 Cloud Computing 50h ✅ Complete (11 files) +50Q
03 Containerization 50h ✅ Complete (14 files) +25Q
04 Kubernetes 50h ✅ Complete (13 files) +30Q
05 Data Pipelines 50h ✅ Complete (12 files) +25Q
06 MLOps 50h ✅ Complete (12 files) +30Q
07 GPU Computing 50h ✅ Complete (12 files) +25Q
08 Monitoring & Observability 50h ✅ Complete (11 files) +25Q
09 Infrastructure as Code 50h ✅ Complete (12 files) +25Q
10 LLM Infrastructure 50h ✅ Complete (12 files) +30Q

3 Production-Grade Projects (77 Files)

Project Technologies Duration Files Status
01: Basic Model Serving FastAPI + K8s + Monitoring 30h ~30 ✅ Complete
02: MLOps Pipeline Airflow + MLflow + DVC 40h 30 ✅ Complete
03: LLM Deployment vLLM + RAG + Vector DB 50h 47 ✅ Complete

Total Repository: 207 files | ~95,000+ lines of code | 500+ hours of learning content


🎓 Prerequisites

Option 1: Complete Junior Curriculum (RECOMMENDED)

If you've completed the Junior AI Infrastructure Engineer curriculum, you have ALL required prerequisites! ✅

The Junior curriculum covers:

  • ✅ Python fundamentals & advanced concepts
  • ✅ Linux/Unix command line mastery
  • ✅ Git & version control workflows
  • ✅ ML basics (PyTorch, TensorFlow)
  • ✅ Docker & containerization
  • ✅ Kubernetes introduction
  • ✅ API development & databases
  • ✅ Monitoring & cloud platforms

Duration: 440 hours (22 weeks part-time, 11 weeks full-time)

Option 2: Self-Assessment

Haven't completed Junior curriculum? Use our comprehensive Prerequisites Guide to:

  • Check your readiness with detailed skill checklists
  • Identify knowledge gaps
  • Get personalized learning recommendations
  • Run automated skill assessment

Minimum Requirements

If self-studying, you must have:

  • Python 3.9+ (intermediate level: OOP, async, testing, type hints)
  • Linux/Unix CLI (bash scripting, processes, debugging)
  • Git fundamentals (branching, merging, collaboration)
  • ML basics (PyTorch/TensorFlow, training, inference, evaluation)
  • Docker basics (images, containers, Compose)
  • Kubernetes intro (pods, deployments, services)

👉 Not sure if you're ready? Read the Prerequisites Guide for detailed assessment.


🚀 Getting Started

Quick Start

# 1. Clone repository
git clone https://github.com/ai-infra-curriculum/ai-infra-engineer-learning.git
cd ai-infra-engineer-learning

# 2. Create virtual environment
python3.11 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Start with Module 01
cd lessons/mod-101-foundations
cat README.md

Learning Path

  1. Modules 01-02 (Foundations) - Start here if new to ML infrastructure
  2. Modules 03-04 (Core Infrastructure) - Docker and Kubernetes mastery
  3. Modules 05-06 (MLOps) - Data pipelines and ML operations
  4. Modules 07-08 (Advanced) - GPU computing and monitoring
  5. Modules 09-10 (Modern Stack) - IaC and LLM infrastructure

Detailed guide: GETTING_STARTED.md


📖 Curriculum Overview

Module 01: Foundations ✅

50 hours | 15 files

Build your foundation in ML infrastructure:

  • ML infrastructure landscape and career paths
  • Python environment setup and best practices
  • ML frameworks (PyTorch, TensorFlow)
  • Docker fundamentals and containerization
  • REST API development with FastAPI

View Module 01 →


Module 02: Cloud Computing ✅

50 hours | 11 files

Master cloud platforms for ML:

  • Cloud architecture for ML workloads
  • AWS (EC2, S3, EKS, SageMaker)
  • GCP (Compute Engine, GCS, GKE, Vertex AI)
  • Azure (VMs, Blob Storage, AKS, Azure ML)
  • Multi-cloud strategies and cost optimization (60-80% savings)

View Module 02 →


Module 03: Containerization ✅

50 hours | 14 files

Deep dive into containers:

  • Docker architecture and best practices
  • Multi-stage builds and optimization
  • Docker Compose for multi-service applications
  • Container registries and image management
  • Security and vulnerability scanning

View Module 03 →


Module 04: Kubernetes ✅

50 hours | 13 files

Master Kubernetes for ML:

  • Kubernetes architecture and components
  • Deployments, Services, ConfigMaps, Secrets
  • GPU resource management and scheduling
  • Autoscaling (HPA, VPA, Cluster Autoscaler)
  • Helm charts and GitOps with ArgoCD

View Module 04 →


Module 05: Data Pipelines ✅

50 hours | 12 files

Build robust data pipelines:

  • Apache Airflow for workflow orchestration
  • Data processing with Apache Spark
  • Streaming data with Apache Kafka
  • Data version control with DVC
  • Data quality validation and monitoring

View Module 05 →


Module 06: MLOps ✅

50 hours | 12 files

Implement MLOps best practices:

  • Experiment tracking with MLflow
  • Model registry and versioning
  • Feature stores and engineering
  • CI/CD for ML models
  • A/B testing and experimentation
  • ML governance and best practices

View Module 06 →


Module 07: GPU Computing & Distributed Training ✅

50 hours | 12 files

Harness GPU power:

  • CUDA programming fundamentals
  • PyTorch GPU acceleration
  • Distributed training (DDP, FSDP)
  • Multi-GPU and multi-node training
  • Model and pipeline parallelism
  • GPU memory optimization

View Module 07 →


Module 08: Monitoring & Observability ✅

50 hours | 11 files

Build comprehensive observability:

  • Prometheus and Grafana
  • Metrics, logs, and traces (OpenTelemetry)
  • Distributed tracing with Jaeger
  • Alerting and incident response
  • Model performance monitoring
  • SLIs, SLOs, and SLAs

View Module 08 →


Module 09: Infrastructure as Code ✅

50 hours | 12 files

Automate infrastructure:

  • Terraform fundamentals and best practices
  • Pulumi for multi-language IaC
  • CloudFormation for AWS
  • State management and modules
  • Multi-environment deployments
  • GitOps workflows

View Module 09 →


Module 10: LLM Infrastructure ✅

50 hours | 12 files

Master cutting-edge LLM infrastructure (2024-2025):

  • LLM serving with vLLM and TensorRT-LLM
  • RAG (Retrieval-Augmented Generation)
  • Vector databases (Pinecone, Weaviate, Milvus)
  • Model quantization (FP16, INT8)
  • GPU optimization for inference
  • Cost tracking and optimization

View Module 10 →


🛠️ Projects

Project 01: Basic Model Serving System ✅

⭐ Beginner | 30 hours | ~30 files

Build a complete model serving system:

  • FastAPI REST API for image classification
  • Docker containerization with optimization
  • Kubernetes deployment with monitoring
  • Prometheus and Grafana dashboards
  • CI/CD pipeline with GitHub Actions

Technologies: FastAPI, Docker, Kubernetes, PyTorch, Prometheus, Grafana

View Project 01 →


Project 02: End-to-End MLOps Pipeline ✅

⭐⭐ Intermediate | 40 hours | 30 files

Create a production MLOps pipeline:

  • Apache Airflow DAGs (data, training, deployment)
  • MLflow experiment tracking and model registry
  • DVC for data versioning
  • Automated model deployment to Kubernetes
  • Comprehensive monitoring and alerting
  • CI/CD with automated testing

Technologies: Airflow, MLflow, DVC, PostgreSQL, Redis, MinIO, Kubernetes

View Project 02 →


Project 03: LLM Deployment Platform ✅

⭐⭐⭐ Advanced | 50 hours | 47 files

Deploy cutting-edge LLM infrastructure:

  • vLLM/TensorRT-LLM for optimized serving
  • RAG system with vector database (Pinecone/ChromaDB/Milvus)
  • Document ingestion pipeline (PDF, TXT, web)
  • FastAPI with Server-Sent Events streaming
  • Kubernetes with GPU support
  • Cost tracking and optimization
  • Comprehensive monitoring

Technologies: vLLM, LangChain, Vector DBs, FastAPI, Kubernetes + GPU, Transformers

View Project 03 →


💰 Cost Considerations

Cloud Costs

All learning materials can be completed within free tier limits:

  • AWS: 750 hours/month t2.micro + $300 credits (varies)
  • GCP: $300 credit (90 days)
  • Azure: $200 credit (30 days)

GPU costs (optional, for advanced projects):

  • On-demand: $1-3/hour
  • Spot instances: $0.30-1/hour (70% savings)
  • Estimated total: $50-150 for complete curriculum

Optimization Tips

  • Use spot instances for training (60-90% savings)
  • Leverage free tiers across multiple cloud providers
  • Delete resources when not in use
  • Use local development where possible

📚 Resources

Included Documentation

  • Comprehensive lesson materials with examples
  • Code stubs with TODO comments for guided implementation
  • Complete project specifications with architecture diagrams
  • Quizzes and assessments for each module
  • Best practices and design patterns

External Resources

Curriculum Documentation


🎯 Learning Outcomes & Career Impact

After Completion, You'll Be Qualified For:

AI Infrastructure Engineer

  • 💰 Salary: $120,000 - $180,000
  • 🏢 Companies: Tech companies, AI startups, ML-focused organizations
  • 📈 Demand: Very high (growing 35% year-over-year)

ML Platform Engineer

  • 💰 Salary: $130,000 - $190,000
  • 🏢 Companies: Large tech firms, enterprises with ML teams
  • 📈 Demand: High (specialized role)

MLOps Engineer

  • 💰 Salary: $110,000 - $170,000
  • 🏢 Companies: All organizations doing ML at scale
  • 📈 Demand: Very high (fastest growing ML role)

Skills You'll Demonstrate

✅ Kubernetes expertise with GPU scheduling ✅ End-to-end MLOps pipeline implementation ✅ LLM infrastructure and RAG systems ✅ Distributed training and GPU optimization ✅ Production monitoring and observability ✅ Cloud platform mastery (AWS, GCP, Azure) ✅ Infrastructure as Code with Terraform ✅ Cost optimization strategies


📊 Repository Statistics

  • Total Files: 207
  • Estimated Lines: ~95,000+
  • Modules: 10 (all complete)
  • Projects: 3 (all complete)
  • Learning Hours: 500+
  • Technologies: 50+

Technology Stack Covered

Core Infrastructure: Docker, Kubernetes, Terraform, Helm, ArgoCD

ML & Data: PyTorch, TensorFlow, Apache Airflow, Apache Spark, Kafka, DVC

MLOps: MLflow, Feature Stores, Model Registry, CI/CD

LLM Infrastructure: vLLM, TensorRT-LLM, LangChain, Vector Databases (Pinecone, Milvus, ChromaDB)

Cloud Platforms: AWS (EC2, S3, EKS, SageMaker), GCP (GCE, GCS, GKE, Vertex AI), Azure (VMs, AKS, Azure ML)

Monitoring: Prometheus, Grafana, OpenTelemetry, Jaeger, ELK Stack

GPU Computing: CUDA, NCCL, Multi-GPU training, Distributed training


🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Bug reports and fixes
  • Documentation improvements
  • New exercises and examples
  • Updated best practices

🆘 Getting Help


📜 License

This project is licensed under the MIT License - see LICENSE for details.


🌟 Success Metrics

Upon completion, you should be able to:

  • Deploy ML models to production with confidence
  • Build complete MLOps pipelines from scratch
  • Implement LLM infrastructure with RAG
  • Optimize cloud costs by 60-80%
  • Debug complex distributed systems
  • Pass technical interviews for AI Infrastructure roles
  • Confidently discuss trade-offs in system design
  • Lead infrastructure projects at your organization

🚀 Next Steps After Completion

This curriculum prepares you for AI Infrastructure Engineer roles. For career progression:

  1. Gain Experience (1-2 years)

    • Work on production ML systems
    • Handle incidents and on-call rotations
    • Contribute to open-source ML infrastructure projects
  2. Advance to Senior Engineer (2-3 years total)

    • Our Senior AI Infrastructure Engineer curriculum (coming soon)
    • Lead larger projects and mentor juniors
    • Design complex systems
  3. Become an Architect (4-6 years total)

    • Our AI Infrastructure Architect curriculum (coming soon)
    • Design enterprise ML platforms
    • Strategic technical leadership

Ready to Master AI Infrastructure Engineering?

Start your journey today!

📘 Get Started | 📚 View Full Curriculum | 🚀 Start Module 01


Star this repository if you find it valuable!

Share with others learning AI Infrastructure Engineering!


Maintained by the AI Infrastructure Curriculum Project Contact: ai-infra-curriculum@joshua-ferguson.com

Happy Learning! 🎓🚀