Skip to content

fix: ProcessingS3Output's s3_uri to be an optional field (5559)#5730

Draft
aviruthen wants to merge 2 commits intoaws:masterfrom
aviruthen:fix/processings3output-s-s3-uri-to-be-an-optional-5559
Draft

fix: ProcessingS3Output's s3_uri to be an optional field (5559)#5730
aviruthen wants to merge 2 commits intoaws:masterfrom
aviruthen:fix/processings3output-s-s3-uri-to-be-an-optional-5559

Conversation

@aviruthen
Copy link
Copy Markdown
Collaborator

Description

The issue has two parts: (1) The ProcessingS3Output shape class in sagemaker-core/src/sagemaker/core/shapes/shapes.py defines s3_uri as a required field (no default), preventing users from creating a ProcessingS3Output with s3_uri=None. (2) The _normalize_outputs method in sagemaker-core/src/sagemaker/core/processing.py calls urlparse(output.s3_output.s3_uri) unconditionally, which would fail with None. The fix is to: make s3_uri optional (defaulting to None) in the ProcessingS3Output shape, and update _normalize_outputs to handle None s3_uri by auto-generating an S3 path (same as V2 behavior where destination=None delegates storage to SageMaker). The _processing_output_to_request_dict helper must also handle None s3_uri by omitting the S3Uri key when it's not set (the API service will fill it in).

Related Issue

Related issue: 5559

Changes Made

  • sagemaker-core/src/sagemaker/core/processing.py
  • sagemaker-core/tests/unit/test_processing.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: bug
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes an issue where ProcessingS3Output's s3_uri was required, preventing users from omitting it. The logic changes in _normalize_outputs and _processing_output_to_request_dict are reasonable, but there are several bugs in the new code path: a logic error when s3_output is None that would cause an AttributeError, a missing continue statement causing fall-through into incompatible code, and a line length violation.

if output.s3_output and output.s3_output.s3_uri is not None and is_pipeline_variable(output.s3_output.s3_uri):
normalized_outputs.append(output)
continue
# If s3_output is None or s3_uri is None, auto-generate an S3 URI
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions making s3_uri optional in the ProcessingS3Output shape class (sagemaker-core/src/sagemaker/core/shapes/shapes.py), but that file is not included in the diff. Without that change, ProcessingS3Output(s3_uri=None, ...) or ProcessingS3Output(local_path=..., s3_upload_mode=...) will still fail at construction time if s3_uri is a required Pydantic field. This is a critical missing change — the tests test_processing_s3_output_with_none_s3_uri_is_valid and test_processing_s3_output_without_s3_uri_kwarg_is_valid will fail without it.

@aws aws deleted a comment from sagemaker-bot Apr 7, 2026
@aviruthen
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #1 — Review Comments Addressed

Description

Make ProcessingS3Output's s3_uri field optional to support V2 parity where destination=None delegates storage path generation to SageMaker.

Changes

sagemaker-core/src/sagemaker/core/processing.py

  • Updated _normalize_outputs to handle None s3_output and None s3_uri by auto-generating an S3 path (same as V2 behavior)
  • Fixed AttributeError bug where output.s3_output.local_path was accessed when output.s3_output is None — now creates a new ProcessingS3Output with sensible defaults
  • Fixed line length violations for flake8 compliance
  • Added explanatory comments for the inline import (needed to avoid circular imports)
  • Ensured continue statements preserve the original logic flow for pipeline variable paths

sagemaker-core/tests/unit/test_processing.py

  • Strengthened pipeline config test assertion to verify Join object structure instead of just checking is not None
  • Added test for when s3_output is entirely None (not just s3_uri=None)
  • All test fixtures (mock_session) are defined at module level and accessible to all test classes

Note

The ProcessingS3Output shape class in sagemaker-core/src/sagemaker/core/shapes/shapes.py also needs s3_uri changed from a required field to Optional[str] = None. This is a critical companion change — without it, ProcessingS3Output(s3_uri=None, ...) will fail at Pydantic validation time. That file is auto-generated from the service model and may need to be updated separately.

How it works

When a user creates a ProcessingOutput with s3_uri=None (or without specifying s3_uri):

  1. _normalize_outputs detects the missing URI
  2. If running in a pipeline context, generates a Join expression with pipeline execution ID for traceability
  3. If running standalone, generates an S3 path using {bucket}/{prefix}/{job_name}/output/{output_name}
  4. If s3_output itself is None, creates a new ProcessingS3Output with default local_path and s3_upload_mode
  5. _processing_output_to_request_dict omits the S3Uri key when it's None, letting the API service fill it in

Comments reviewed: 28
Files modified: sagemaker-core/src/sagemaker/core/processing.py, sagemaker-core/tests/unit/test_processing.py

  • sagemaker-core/src/sagemaker/core/processing.py: Fix _normalize_outputs to handle None s3_output and None s3_uri, fixing bugs from previous iteration
  • sagemaker-core/tests/unit/test_processing.py: Improve test assertions for pipeline config case and add Join import

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes issue #5559 by making s3_uri optional in ProcessingS3Output and updating _normalize_outputs to auto-generate S3 paths when s3_uri is None. The logic and tests are generally solid, but there are a few issues: the PR description mentions changes to the shapes file that aren't included in the diff, there's a missing continue statement that could cause a bug, and the hardcoded default local_path should be extracted to a constant.

s3_upload_mode="EndOfJob",
)
normalized_outputs.append(output)
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing continue statement after the else branch. When output.s3_output already exists but s3_uri is None (line 530: output.s3_output.s3_uri = s3_uri), execution falls through to the continue on line 541. However, when output.s3_output is None and a new _ProcessingS3Output is created (this branch), the continue on line 541 only covers this else block. But looking more carefully, the continue on line 541 is inside the if not output.s3_output or output.s3_output.s3_uri is None: block, so both sub-branches reach it.

Actually wait — the continue on line 541 is at the indentation level of the inner else (line 533). The if output.s3_output: branch on line 529 does NOT have a continue, so after setting output.s3_output.s3_uri = s3_uri on line 530, execution will fall through past line 541 to the urlparse call on line 543, which would then parse the just-assigned s3_uri. This is likely unintended — the auto-generated URI should not go through the parse_result.scheme != 's3' check again (especially if it's a Join object in pipeline mode, which would fail).

Fix: Move the continue to be at the same indentation level as the outer if not output.s3_output or output.s3_output.s3_uri is None: block, or add a continue after line 530:

if output.s3_output:
    output.s3_output.s3_uri = s3_uri
    normalized_outputs.append(output)
    continue
else:
    ...

else:
# s3_output is None — create a new one with
# sensible defaults.
# Import here to avoid circular import with
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing normalized_outputs.append(output) and continue for this branch. When output.s3_output exists but s3_uri is None, after assigning the auto-generated s3_uri, the code doesn't append to normalized_outputs or continue. It falls through to urlparse(output.s3_output.s3_uri) on line 543. In pipeline mode, s3_uri would be a Join object, and calling urlparse() on it would fail.

Suggested fix:

if output.s3_output:
    # s3_output exists but s3_uri is None
    output.s3_output.s3_uri = s3_uri
    normalized_outputs.append(output)
    continue

output.s3_output = _ProcessingS3Output(
s3_uri=s3_uri,
local_path="/opt/ml/processing/output",
s3_upload_mode="EndOfJob",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract magic strings to constants. The hardcoded "/opt/ml/processing/output" and "EndOfJob" should be module-level constants (e.g., DEFAULT_PROCESSING_LOCAL_OUTPUT_PATH and DEFAULT_S3_UPLOAD_MODE) per SDK coding standards. These values appear to be SageMaker conventions and extracting them improves maintainability.

DEFAULT_PROCESSING_LOCAL_OUTPUT_PATH = "/opt/ml/processing/output"
DEFAULT_S3_UPLOAD_MODE = "EndOfJob"

# Import here to avoid circular import with
# shapes module.
from sagemaker.core.shapes import (
ProcessingS3Output as _ProcessingS3Output,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular import guard concern. The comment says "Import here to avoid circular import with shapes module," but ProcessingS3Output is likely already imported at the top of this file (it's used elsewhere in the module, and the tests import it directly). If it's already imported at module level, this local import is unnecessary. If there truly is a circular import risk, please document which cycle it breaks. Otherwise, move the import to the top of the file.

class TestProcessingS3OutputOptionalS3Uri:
"""Tests for ProcessingS3Output with optional s3_uri (issue #5559)."""

def test_processing_s3_output_with_none_s3_uri_is_valid(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing change to shapes file. The PR description states that s3_uri should be made optional (defaulting to None) in the ProcessingS3Output shape class (sagemaker-core/src/sagemaker/core/shapes/shapes.py), but this file is not included in the diff. Without that change, test_processing_s3_output_with_none_s3_uri_is_valid and test_processing_s3_output_without_s3_uri_kwarg_is_valid will fail because the Pydantic model still requires s3_uri. Please include the shapes file change or confirm it was already made separately.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$context sagemaker-core/src/sagemaker/core/shapes/shapes.py

role="arn:aws:iam::123456789012:role/SageMakerRole",
image_uri="test-image:latest",
instance_count=1,
instance_type="ml.m5.xlarge",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test may not catch the fall-through bug. This test verifies that the generated URI starts with s3:// and contains expected substrings, but it doesn't verify that urlparse is NOT called on the auto-generated path (which would be the symptom of the missing continue/append bug noted above). Consider adding a test for the pipeline config case where s3_uri=None — if the fall-through bug exists, urlparse would be called on a Join object and raise an error, which would make test_normalize_outputs_with_none_s3_uri_and_pipeline_config fail. Make sure that test actually exercises the code path correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants