fix: ModelBuilder with source_code + DJL LMI: /opt/ml/model becomes read-only, breaki (5698)#5733
Conversation
…ead-only, breaki (5698)
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR fixes a real bug where DJL builder lacks HF cache redirection env vars (unlike TGI/TEI), causing read-only filesystem errors when source_code is provided. The fix is sound, but there are issues with duplicate env var setting, test style (unittest instead of pytest), and significant test code duplication that should be addressed.
| "OPTION_MODEL_LOADING_TIMEOUT": "240", | ||
| "OPTION_PREDICT_TIMEOUT": "60", | ||
| "TENSOR_PARALLEL_DEGREE": "1" # Default, will be overridden below | ||
| "TENSOR_PARALLEL_DEGREE": "1", # Default, will be overridden below |
There was a problem hiding this comment.
Potential duplicate/conflicting env var setting: HF_HOME and HUGGINGFACE_HUB_CACHE are set here (inside the not self._is_jumpstart_model_id() branch, line 348-349), and then again in the else branch at lines 375-376 (non-local mode). This means for non-local, non-JumpStart models, these values get set twice (which is harmless but redundant). However, for JumpStart models or local modes, the behavior differs:
- Local mode: The
if self.mode in LOCAL_MODESbranch setsHF_HUB_OFFLINEbut does NOT setHF_HOME/HUGGINGFACE_HUB_CACHE. If the model is not a JumpStart model, these were already set at line 348. But if it IS a JumpStart model, they won't be set at all in local mode. Is that intentional? - Non-local JumpStart models: They'll get the env vars from lines 375-376 but not from 348-349.
Consider consolidating the HF cache env var setting to a single location (e.g., always set them regardless of JumpStart status and mode) to make the logic clearer and avoid subtle gaps.
| # Cache management based on mode | ||
| if self.mode in LOCAL_MODES: | ||
| self.env_vars.update({"HF_HUB_OFFLINE": "1"}) | ||
| else: |
There was a problem hiding this comment.
Consider using setdefault here too: For consistency with the HF_MODEL_ID change, consider using self.env_vars.setdefault("HF_HOME", "/tmp") and self.env_vars.setdefault("HUGGINGFACE_HUB_CACHE", "/tmp") so that if a user explicitly provides these env vars (e.g., pointing to a different writable directory), their values are preserved. The same applies to lines 348-349.
| @@ -0,0 +1,318 @@ | |||
| """Tests for DJL builder HF cache environment variables and HF_MODEL_ID handling. | |||
There was a problem hiding this comment.
Test style: The SDK uses pytest as the test framework (per unit test standards). This file uses unittest.TestCase with self.assertEqual. Please refactor to use pytest-style tests with plain assert statements, pytest.fixture for shared setup, and pytest.mark.parametrize where applicable. For example, test_build_for_djl_sets_hf_home_to_tmp and test_build_for_djl_sets_huggingface_hub_cache_to_tmp test the same scenario and could be a single test or parametrized.
| @patch('sagemaker.serve.model_builder_servers._get_nb_instance') | ||
| def test_build_for_djl_sets_hf_home_to_tmp( | ||
| self, mock_nb, mock_djl_config, mock_hf_config, mock_is_js, | ||
| mock_validate, mock_auto_detect, mock_prepare, mock_create, |
There was a problem hiding this comment.
Massive test duplication: Nearly every test method has the same ~30 lines of mock setup and builder construction copied verbatim. Extract the common mock setup and builder creation into a pytest.fixture (or at minimum a helper method). This would reduce the file by ~60% and make it much easier to maintain. For example:
@pytest.fixture
def djl_builder(mock_session, temp_dir):
# common mock patches and builder setup
...
return builderThere was a problem hiding this comment.
Agreed! Only a few tests would suffice. We should strive for test quality over quantity!
| @patch('sagemaker.serve.model_builder_servers._get_nb_instance') | ||
| def test_build_for_djl_sets_hf_home_to_tmp( | ||
| self, mock_nb, mock_djl_config, mock_hf_config, mock_is_js, | ||
| mock_validate, mock_auto_detect, mock_prepare, mock_create, |
There was a problem hiding this comment.
Tests test_build_for_djl_sets_hf_home_to_tmp and test_build_for_djl_sets_huggingface_hub_cache_to_tmp are redundant with test_build_for_djl_with_source_code_and_hf_model_id: The last test already asserts both HF_HOME and HUGGINGFACE_HUB_CACHE. Consider consolidating these three tests into one that checks both env vars, following the "one logical assertion per test" guideline (checking two related env vars from the same operation is one logical assertion).
|
|
||
| import unittest | ||
| from unittest.mock import Mock, patch, MagicMock | ||
| import tempfile |
There was a problem hiding this comment.
Unused imports: MagicMock, os, and shutil are imported but MagicMock is never used. os and shutil are only used for temp dir cleanup which pytest's tmp_path fixture handles automatically. Clean up unused imports.
| MOCK_ROLE_ARN = "arn:aws:iam::123456789012:role/SageMakerRole" | ||
| MOCK_IMAGE_URI = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.36.0-lmi22.0.0-cu129" | ||
| MOCK_HF_MODEL_CONFIG = {"model_type": "gpt2", "architectures": ["GPT2LMHeadModel"]} | ||
|
|
There was a problem hiding this comment.
Hardcoded AWS account ID in mock: MOCK_ROLE_ARN contains 123456789012 and MOCK_IMAGE_URI contains a real ECR registry ID (763104351884). While these are mocks, using a clearly fake ECR URI (e.g., 000000000000.dkr.ecr.us-east-1.amazonaws.com/djl-inference:latest) would be more consistent with test standards that avoid real account/region references.
| if isinstance(self.model, str) and not self._is_jumpstart_model_id(): | ||
| # Configure HuggingFace model for DJL | ||
| self.env_vars.update({"HF_MODEL_ID": self.model}) | ||
| self.env_vars.setdefault("HF_MODEL_ID", self.model) |
There was a problem hiding this comment.
Trailing whitespace: There appears to be trailing whitespace on this line (after self.model)). While CI formatting tools may catch this, it's worth noting.
🤖 Iteration #1 — Review Comments AddressedDescriptionFix The issue has two root causes in
Changes Made
Comments reviewed: 9
|
Description
The issue has two root causes in
_build_for_djl()in model_builder_servers.py:Missing HF cache redirection: Unlike
_build_for_tgi()and_build_for_tei()which setHF_HOME=/tmpandHUGGINGFACE_HUB_CACHE=/tmp, the DJL builder never sets these environment variables. When source_code is provided, the model artifacts (requirements.txt etc.) get packaged as model.tar.gz and mounted read-only at /opt/ml/model/. The DJL container then tries to download HF models to /opt/ml/model/ (the default cache location) and fails with EROFS.HF_MODEL_ID override:
_build_for_djl()unconditionally callsself.env_vars.update({'HF_MODEL_ID': self.model}), which overwrites any user-provided HF_MODEL_ID value. This prevents users from setting HF_MODEL_ID to a local path (e.g., /opt/ml/model) when they want to use pre-downloaded model artifacts.The fix adds HF cache environment variables (HF_HOME, HUGGINGFACE_HUB_CACHE) pointing to /tmp for the DJL builder (matching TGI/TEI behavior), and changes HF_MODEL_ID to use setdefault() so user-provided values are preserved.
Related Issue
Related issue: 5698
Changes Made
sagemaker-serve/src/sagemaker/serve/model_builder_servers.pysagemaker-serve/tests/unit/servers/__init__.pysagemaker-serve/tests/unit/servers/test_djl_hf_cache_env.pyAI-Generated PR
This PR was automatically generated by the PySDK Issue Agent.
Merge Checklist
prefix: descriptionformat