Skip to content

fix: ModelBuilder with source_code + DJL LMI: /opt/ml/model becomes read-only, breaki (5698)#5733

Draft
aviruthen wants to merge 2 commits intoaws:masterfrom
aviruthen:fix/modelbuilder-with-source-code-djl-lmi-opt-ml-5698
Draft

fix: ModelBuilder with source_code + DJL LMI: /opt/ml/model becomes read-only, breaki (5698)#5733
aviruthen wants to merge 2 commits intoaws:masterfrom
aviruthen:fix/modelbuilder-with-source-code-djl-lmi-opt-ml-5698

Conversation

@aviruthen
Copy link
Copy Markdown
Collaborator

Description

The issue has two root causes in _build_for_djl() in model_builder_servers.py:

  1. Missing HF cache redirection: Unlike _build_for_tgi() and _build_for_tei() which set HF_HOME=/tmp and HUGGINGFACE_HUB_CACHE=/tmp, the DJL builder never sets these environment variables. When source_code is provided, the model artifacts (requirements.txt etc.) get packaged as model.tar.gz and mounted read-only at /opt/ml/model/. The DJL container then tries to download HF models to /opt/ml/model/ (the default cache location) and fails with EROFS.

  2. HF_MODEL_ID override: _build_for_djl() unconditionally calls self.env_vars.update({'HF_MODEL_ID': self.model}), which overwrites any user-provided HF_MODEL_ID value. This prevents users from setting HF_MODEL_ID to a local path (e.g., /opt/ml/model) when they want to use pre-downloaded model artifacts.

The fix adds HF cache environment variables (HF_HOME, HUGGINGFACE_HUB_CACHE) pointing to /tmp for the DJL builder (matching TGI/TEI behavior), and changes HF_MODEL_ID to use setdefault() so user-provided values are preserved.

Related Issue

Related issue: 5698

Changes Made

  • sagemaker-serve/src/sagemaker/serve/model_builder_servers.py
  • sagemaker-serve/tests/unit/servers/__init__.py
  • sagemaker-serve/tests/unit/servers/test_djl_hf_cache_env.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: bug
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes a real bug where DJL builder lacks HF cache redirection env vars (unlike TGI/TEI), causing read-only filesystem errors when source_code is provided. The fix is sound, but there are issues with duplicate env var setting, test style (unittest instead of pytest), and significant test code duplication that should be addressed.

"OPTION_MODEL_LOADING_TIMEOUT": "240",
"OPTION_PREDICT_TIMEOUT": "60",
"TENSOR_PARALLEL_DEGREE": "1" # Default, will be overridden below
"TENSOR_PARALLEL_DEGREE": "1", # Default, will be overridden below
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential duplicate/conflicting env var setting: HF_HOME and HUGGINGFACE_HUB_CACHE are set here (inside the not self._is_jumpstart_model_id() branch, line 348-349), and then again in the else branch at lines 375-376 (non-local mode). This means for non-local, non-JumpStart models, these values get set twice (which is harmless but redundant). However, for JumpStart models or local modes, the behavior differs:

  • Local mode: The if self.mode in LOCAL_MODES branch sets HF_HUB_OFFLINE but does NOT set HF_HOME/HUGGINGFACE_HUB_CACHE. If the model is not a JumpStart model, these were already set at line 348. But if it IS a JumpStart model, they won't be set at all in local mode. Is that intentional?
  • Non-local JumpStart models: They'll get the env vars from lines 375-376 but not from 348-349.

Consider consolidating the HF cache env var setting to a single location (e.g., always set them regardless of JumpStart status and mode) to make the logic clearer and avoid subtle gaps.

# Cache management based on mode
if self.mode in LOCAL_MODES:
self.env_vars.update({"HF_HUB_OFFLINE": "1"})
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using setdefault here too: For consistency with the HF_MODEL_ID change, consider using self.env_vars.setdefault("HF_HOME", "/tmp") and self.env_vars.setdefault("HUGGINGFACE_HUB_CACHE", "/tmp") so that if a user explicitly provides these env vars (e.g., pointing to a different writable directory), their values are preserved. The same applies to lines 348-349.

@@ -0,0 +1,318 @@
"""Tests for DJL builder HF cache environment variables and HF_MODEL_ID handling.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test style: The SDK uses pytest as the test framework (per unit test standards). This file uses unittest.TestCase with self.assertEqual. Please refactor to use pytest-style tests with plain assert statements, pytest.fixture for shared setup, and pytest.mark.parametrize where applicable. For example, test_build_for_djl_sets_hf_home_to_tmp and test_build_for_djl_sets_huggingface_hub_cache_to_tmp test the same scenario and could be a single test or parametrized.

@patch('sagemaker.serve.model_builder_servers._get_nb_instance')
def test_build_for_djl_sets_hf_home_to_tmp(
self, mock_nb, mock_djl_config, mock_hf_config, mock_is_js,
mock_validate, mock_auto_detect, mock_prepare, mock_create,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive test duplication: Nearly every test method has the same ~30 lines of mock setup and builder construction copied verbatim. Extract the common mock setup and builder creation into a pytest.fixture (or at minimum a helper method). This would reduce the file by ~60% and make it much easier to maintain. For example:

@pytest.fixture
def djl_builder(mock_session, temp_dir):
    # common mock patches and builder setup
    ...
    return builder

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Only a few tests would suffice. We should strive for test quality over quantity!

@patch('sagemaker.serve.model_builder_servers._get_nb_instance')
def test_build_for_djl_sets_hf_home_to_tmp(
self, mock_nb, mock_djl_config, mock_hf_config, mock_is_js,
mock_validate, mock_auto_detect, mock_prepare, mock_create,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests test_build_for_djl_sets_hf_home_to_tmp and test_build_for_djl_sets_huggingface_hub_cache_to_tmp are redundant with test_build_for_djl_with_source_code_and_hf_model_id: The last test already asserts both HF_HOME and HUGGINGFACE_HUB_CACHE. Consider consolidating these three tests into one that checks both env vars, following the "one logical assertion per test" guideline (checking two related env vars from the same operation is one logical assertion).


import unittest
from unittest.mock import Mock, patch, MagicMock
import tempfile
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports: MagicMock, os, and shutil are imported but MagicMock is never used. os and shutil are only used for temp dir cleanup which pytest's tmp_path fixture handles automatically. Clean up unused imports.

MOCK_ROLE_ARN = "arn:aws:iam::123456789012:role/SageMakerRole"
MOCK_IMAGE_URI = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.36.0-lmi22.0.0-cu129"
MOCK_HF_MODEL_CONFIG = {"model_type": "gpt2", "architectures": ["GPT2LMHeadModel"]}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded AWS account ID in mock: MOCK_ROLE_ARN contains 123456789012 and MOCK_IMAGE_URI contains a real ECR registry ID (763104351884). While these are mocks, using a clearly fake ECR URI (e.g., 000000000000.dkr.ecr.us-east-1.amazonaws.com/djl-inference:latest) would be more consistent with test standards that avoid real account/region references.

if isinstance(self.model, str) and not self._is_jumpstart_model_id():
# Configure HuggingFace model for DJL
self.env_vars.update({"HF_MODEL_ID": self.model})
self.env_vars.setdefault("HF_MODEL_ID", self.model)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace: There appears to be trailing whitespace on this line (after self.model)). While CI formatting tools may catch this, it's worth noting.

@aviruthen
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #1 — Review Comments Addressed

Description

Fix ModelBuilder with source_code + DJL LMI: /opt/ml/model becomes read-only, breaking HF Hub model downloads.

The issue has two root causes in _build_for_djl() in model_builder_servers.py:

  1. Missing HF cache redirection: Unlike _build_for_tgi() and _build_for_tei() which set HF_HOME=/tmp and HUGGINGFACE_HUB_CACHE=/tmp, the DJL builder never set these environment variables. When source_code is provided, the model artifacts (requirements.txt etc.) get packaged as model.tar.gz and mounted read-only at /opt/ml/model/. The DJL container then tries to download HF models to /opt/ml/model/ (the default cache location) and fails with EROFS.

  2. HF_MODEL_ID override: _build_for_djl() unconditionally called self.env_vars.update({'HF_MODEL_ID': self.model}), which overwrote any user-provided HF_MODEL_ID value. This prevented users from setting HF_MODEL_ID to a local path (e.g., /opt/ml/model) when they want to use pre-downloaded model artifacts.

Changes Made

  • sagemaker-serve/src/sagemaker/serve/model_builder_servers.py:

    • Added HF_HOME and HUGGINGFACE_HUB_CACHE env vars pointing to /tmp for the DJL builder, consolidated to a single location using setdefault() so user-provided values are preserved
    • Changed HF_MODEL_ID to use setdefault() so user-provided values are not overridden
  • sagemaker-serve/tests/unit/servers/test_djl_hf_cache_env.py:

    • Added pytest-style tests verifying HF cache env vars, HF_MODEL_ID preservation, and local mode offline behavior
    • Uses fixtures and helper functions to minimize duplication

Comments reviewed: 9
Files modified: sagemaker-serve/src/sagemaker/serve/model_builder_servers.py, sagemaker-serve/tests/unit/servers/test_djl_hf_cache_env.py

  • sagemaker-serve/src/sagemaker/serve/model_builder_servers.py: Fix DJL builder to use setdefault for HF_MODEL_ID (preserving user values), consolidate HF cache env vars to a single location using setdefault, and remove trailing whitespace
  • sagemaker-serve/tests/unit/servers/test_djl_hf_cache_env.py: Rewrite tests using pytest style with fixtures, consolidating redundant tests and cleaning up imports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants