Summary
When using parallel_backend="ray" (the default), Ray auto-packages the working directory and creates a fresh virtual environment per worker in a temporary directory. For projects with heavy dependencies (e.g., PyTorch ~12GB), this causes:
- Disk exhaustion: Each Ray cluster creates a full venv copy (~12GB+). With 4-10 concurrent experiments, this can consume 50-120GB+ in
/tmp, filling the root partition.
- Worker startup hangs: Workers hang during
uv sync / pip install in the temp venv, producing repeated worker_pool.cc: Some workers of the worker process have not registered within the timeout errors.
- GCS crashes: When too many clusters compete for resources, Ray's GCS (Global Control Store) becomes unresponsive, causing
Failed to connect to GCS within 60 seconds and terminating experiments.
- AF_UNIX socket path limit: If the temp directory path is long (e.g., a scratch filesystem), the Unix socket path exceeds the 107-byte limit, causing
OSError: AF_UNIX path length cannot exceed 107 bytes.
Reproduction
- Have a project with
pyproject.toml that depends on PyTorch (or similar large packages)
- Launch 3+ concurrent
Study.run() calls with parallel_backend="ray" (default)
- Observe
/tmp filling up with ray_* directories, each containing a full venv
Root Cause
In agentlab/experiments/launch_exp.py:85:
ray.init(num_cpus=n_jobs)
This bare ray.init() causes Ray to auto-detect the working directory (which contains pyproject.toml) and package it for workers. Each worker then runs uv sync to create a fresh venv in the Ray temp directory, re-installing all dependencies from scratch.
Key behaviors:
- Ray creates a new temp directory per
ray.init() call (each experiment gets its own cluster)
- Each cluster's workers build an independent venv copy
- Failed/completed experiments leave their temp directories behind (no cleanup)
ray.shutdown() in launch_exp.py:89 does not clean up the temp directory
Impact
- Experiments silently fail with
ENOSPC errors (Playwright can't create browser profiles when disk is full)
- Hundreds of tasks get recorded as errors that are actually disk-full failures, requiring full reruns
- The problem compounds: each relaunch creates additional temp directories
Workaround
Using parallel_backend="joblib" avoids Ray entirely and doesn't have this issue. However, joblib doesn't support Ray's task graph execution (dependency tracking between tasks).
Another workaround is to set RAY_TMPDIR to a large filesystem and create isolated Ray temp dirs there, but this hits the AF_UNIX 107-byte socket path limit if the path is too long.
Suggested Fix
- Disable Ray's auto-packaging by setting
runtime_env={"worker_process_setup_hook": ...} or RAY_RUNTIME_ENV_HOOK to prevent venv creation in workers
- Or pass
runtime_env={"py_modules": [...]} with only the necessary module instead of the full working directory
- Or set
ray.init(runtime_env={"working_dir": None}) to prevent auto-packaging
- Add temp directory cleanup in the
finally block after ray.shutdown() -- clean up the Ray temp directory
Environment
- AgentLab v0.4.0
- Ray 2.51.1
- Python 3.12
- Ubuntu 22.04
- Dependencies include PyTorch 2.8.0 (~12GB installed)
Summary
When using
parallel_backend="ray"(the default), Ray auto-packages the working directory and creates a fresh virtual environment per worker in a temporary directory. For projects with heavy dependencies (e.g., PyTorch ~12GB), this causes:/tmp, filling the root partition.uv sync/pip installin the temp venv, producing repeatedworker_pool.cc: Some workers of the worker process have not registered within the timeouterrors.Failed to connect to GCS within 60 secondsand terminating experiments.OSError: AF_UNIX path length cannot exceed 107 bytes.Reproduction
pyproject.tomlthat depends on PyTorch (or similar large packages)Study.run()calls withparallel_backend="ray"(default)/tmpfilling up withray_*directories, each containing a full venvRoot Cause
In
agentlab/experiments/launch_exp.py:85:This bare
ray.init()causes Ray to auto-detect the working directory (which containspyproject.toml) and package it for workers. Each worker then runsuv syncto create a fresh venv in the Ray temp directory, re-installing all dependencies from scratch.Key behaviors:
ray.init()call (each experiment gets its own cluster)ray.shutdown()inlaunch_exp.py:89does not clean up the temp directoryImpact
ENOSPCerrors (Playwright can't create browser profiles when disk is full)Workaround
Using
parallel_backend="joblib"avoids Ray entirely and doesn't have this issue. However, joblib doesn't support Ray's task graph execution (dependency tracking between tasks).Another workaround is to set
RAY_TMPDIRto a large filesystem and create isolated Ray temp dirs there, but this hits the AF_UNIX 107-byte socket path limit if the path is too long.Suggested Fix
runtime_env={"worker_process_setup_hook": ...}orRAY_RUNTIME_ENV_HOOKto prevent venv creation in workersruntime_env={"py_modules": [...]}with only the necessary module instead of the full working directoryray.init(runtime_env={"working_dir": None})to prevent auto-packagingfinallyblock afterray.shutdown()-- clean up the Ray temp directoryEnvironment