fix: ModelBuilder with source_code + DJL LMI: /opt/ml/model becomes read-only, breaki (5698) by aviruthen · Pull Request #5733 · aws/sagemaker-python-sdk

aviruthen · 2026-04-07T19:58:29Z

Description

The issue has two root causes in _build_for_djl() in model_builder_servers.py:

Missing HF cache redirection: Unlike _build_for_tgi() and _build_for_tei() which set HF_HOME=/tmp and HUGGINGFACE_HUB_CACHE=/tmp, the DJL builder never sets these environment variables. When source_code is provided, the model artifacts (requirements.txt etc.) get packaged as model.tar.gz and mounted read-only at /opt/ml/model/. The DJL container then tries to download HF models to /opt/ml/model/ (the default cache location) and fails with EROFS.
HF_MODEL_ID override: _build_for_djl() unconditionally calls self.env_vars.update({'HF_MODEL_ID': self.model}), which overwrites any user-provided HF_MODEL_ID value. This prevents users from setting HF_MODEL_ID to a local path (e.g., /opt/ml/model) when they want to use pre-downloaded model artifacts.

The fix adds HF cache environment variables (HF_HOME, HUGGINGFACE_HUB_CACHE) pointing to /tmp for the DJL builder (matching TGI/TEI behavior), and changes HF_MODEL_ID to use setdefault() so user-provided values are preserved.

Related Issue

Related issue: 5698

Changes Made

sagemaker-serve/src/sagemaker/serve/model_builder_servers.py
sagemaker-serve/tests/unit/servers/__init__.py
sagemaker-serve/tests/unit/servers/test_djl_hf_cache_env.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

Confidence score: 85%
Classification: bug
SDK version target: V3

Merge Checklist

Changes are backward compatible
Commit message follows prefix: description format
Unit tests added/updated
Integration tests added (if applicable)
Documentation updated (if applicable)

…ead-only, breaki (5698)

sagemaker-bot

🤖 AI Code Review

This PR fixes a real bug where DJL builder lacks HF cache redirection env vars (unlike TGI/TEI), causing read-only filesystem errors when source_code is provided. The fix is sound, but there are issues with duplicate env var setting, test style (unittest instead of pytest), and significant test code duplication that should be addressed.

sagemaker-bot · 2026-04-07T21:49:35Z

sagemaker-serve/src/sagemaker/serve/model_builder_servers.py

                "OPTION_MODEL_LOADING_TIMEOUT": "240",
                "OPTION_PREDICT_TIMEOUT": "60",
-                "TENSOR_PARALLEL_DEGREE": "1"  # Default, will be overridden below
+                "TENSOR_PARALLEL_DEGREE": "1",  # Default, will be overridden below


Potential duplicate/conflicting env var setting: HF_HOME and HUGGINGFACE_HUB_CACHE are set here (inside the not self._is_jumpstart_model_id() branch, line 348-349), and then again in the else branch at lines 375-376 (non-local mode). This means for non-local, non-JumpStart models, these values get set twice (which is harmless but redundant). However, for JumpStart models or local modes, the behavior differs:

Local mode: The if self.mode in LOCAL_MODES branch sets HF_HUB_OFFLINE but does NOT set HF_HOME/HUGGINGFACE_HUB_CACHE. If the model is not a JumpStart model, these were already set at line 348. But if it IS a JumpStart model, they won't be set at all in local mode. Is that intentional?

Non-local JumpStart models: They'll get the env vars from lines 375-376 but not from 348-349.

Consider consolidating the HF cache env var setting to a single location (e.g., always set them regardless of JumpStart status and mode) to make the logic clearer and avoid subtle gaps.

sagemaker-bot · 2026-04-07T21:49:35Z

sagemaker-serve/src/sagemaker/serve/model_builder_servers.py

        # Cache management based on mode
        if self.mode in LOCAL_MODES:
            self.env_vars.update({"HF_HUB_OFFLINE": "1"})
+        else:


Consider using setdefault here too: For consistency with the HF_MODEL_ID change, consider using self.env_vars.setdefault("HF_HOME", "/tmp") and self.env_vars.setdefault("HUGGINGFACE_HUB_CACHE", "/tmp") so that if a user explicitly provides these env vars (e.g., pointing to a different writable directory), their values are preserved. The same applies to lines 348-349.

sagemaker-bot · 2026-04-07T21:49:35Z