Skip to content

fix: load_feature_definitions_from_dataframe() doesn't recognize pandas nullable dtyp (5675)#5732

Draft
aviruthen wants to merge 2 commits intoaws:masterfrom
aviruthen:fix/load-feature-definitions-from-dataframe-doesn-t-5675
Draft

fix: load_feature_definitions_from_dataframe() doesn't recognize pandas nullable dtyp (5675)#5732
aviruthen wants to merge 2 commits intoaws:masterfrom
aviruthen:fix/load-feature-definitions-from-dataframe-doesn-t-5675

Conversation

@aviruthen
Copy link
Copy Markdown
Collaborator

Description

The issue is in sagemaker-mlops/src/sagemaker/mlops/feature_store/feature_utils.py. The _INTEGER_TYPES and _FLOAT_TYPES sets only contain lowercase numpy dtype names (e.g., 'int64', 'float64'). Pandas nullable dtypes use capitalized names (e.g., 'Int64', 'Float64', 'string') and are not matched, causing all nullable-typed columns to fall through to StringFeatureDefinition. The fix is to add pandas nullable dtype names to _INTEGER_TYPES and _FLOAT_TYPES, and also add 'string' to the string-type handling in _generate_feature_definition. The referenced PR #3740 fixed this in V2 but the fix was not carried over to the V3 (sagemaker-mlops) codebase. Additionally, the _DTYPE_TO_FEATURE_TYPE_MAP dict already has 'string' mapped but is not used by _generate_feature_definition; however the sets approach is the active code path, so we fix the sets.

Related Issue

Related issue: 5675

Changes Made

  • sagemaker-mlops/src/sagemaker/mlops/feature_store/feature_utils.py
  • sagemaker-mlops/tests/unit/sagemaker/mlops/feature_store/test_feature_utils.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: bug
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes a bug where pandas nullable dtypes (Int64, Float64, string) were not recognized by load_feature_definitions_from_dataframe(), causing them to incorrectly fall through to StringFeatureDefinition. The fix adds the capitalized pandas nullable dtype names to the existing type sets. The approach is correct and tests are comprehensive, but there are a few issues: the _STRING_TYPES constant is defined but never used in _generate_feature_definition, and some test lines exceed the 100-character limit.

# pandas nullable float dtypes
"Float32", "Float64",
}
_STRING_TYPES = {"object", "string"}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: _STRING_TYPES is defined but never used. You've added the _STRING_TYPES = {"object", "string"} constant, but the diff doesn't show any changes to _generate_feature_definition to actually use it. Without updating the function to check dtype_name in _STRING_TYPES, the "string" dtype will still fall through to the default case. Please update _generate_feature_definition to use _STRING_TYPES for the string type check, e.g.:

elif dtype_name in _STRING_TYPES:
    return FeatureDefinition(feature_name=column, feature_type="String")

Can you confirm that the _generate_feature_definition function is also updated to use _STRING_TYPES? If not, the test_infers_string_type_with_pandas_string_dtype test would fail.

def test_infers_correct_types_after_convert_dtypes(self):
df = pd.DataFrame({
"id": [1, 2, 3],
"price": [1.1, 2.2, 3.3],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertion lines exceed the 100-character line length limit. Consider breaking them across multiple lines for readability:

result = next(d for d in defs if d.feature_name == "nullable_float")
assert result.feature_type == "Fractional"

@@ -49,6 +49,78 @@ def test_returns_correct_count(self, sample_dataframe):
defs = load_feature_definitions_from_dataframe(sample_dataframe)
assert len(defs) == 3

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using pytest.mark.parametrize to consolidate the individual nullable integer dtype tests (Int8, Int16, Int32, Int64, UInt32, UInt64) into a single parameterized test. This reduces duplication and makes it easier to add new dtypes in the future:

@pytest.mark.parametrize("dtype", ["Int8", "Int16", "Int32", "Int64", "UInt8", "UInt16", "UInt32", "UInt64"])
def test_infers_integral_type_with_pandas_nullable_int(self, dtype):
    df = pd.DataFrame({"id": pd.Series([1, 2, 3], dtype=dtype)})
    defs = load_feature_definitions_from_dataframe(df)
    assert defs[0].feature_type == "Integral"

Same applies to the Float32/Float64 tests.

@@ -46,8 +46,19 @@
"float64": "Fractional",
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The _DTYPE_TO_FEATURE_TYPE_MAP dict above should also be updated to include the pandas nullable dtype mappings (e.g., "Int64": "Integral", "Float64": "Fractional", "string": "String") for consistency, even if it's not the active code path. This prevents future confusion if someone tries to use the map instead of the sets.

@aviruthen
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #1 — Review Comments Addressed

Description

Fix load_feature_definitions_from_dataframe() to correctly recognize pandas nullable dtypes (Int64, Float64, string, etc.).

Problem

When a DataFrame uses pandas nullable dtypes (common after calling pd.DataFrame.convert_dtypes()), all numeric columns were incorrectly mapped to StringFeatureDefinition. This is because _INTEGER_TYPES and _FLOAT_TYPES only contained lowercase numpy dtype names (e.g., int64, float64), while pandas nullable dtypes use capitalized names (e.g., Int64, Float64).

Changes

sagemaker-mlops/src/sagemaker/mlops/feature_store/feature_utils.py:

  • Added pandas nullable dtype mappings to _DTYPE_TO_FEATURE_TYPE_MAP for consistency
  • Updated _generate_feature_definition to explicitly check _STRING_TYPES so the "string" dtype is properly handled rather than falling through to the default case
  • _INTEGER_TYPES, _FLOAT_TYPES, and _STRING_TYPES sets (already added in prior iteration) correctly include pandas nullable dtype names

sagemaker-mlops/tests/unit/sagemaker/mlops/feature_store/test_feature_utils.py:

  • Consolidated individual nullable integer dtype tests into a single pytest.mark.parametrize test covering Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64
  • Consolidated nullable float dtype tests into a single parametrized test covering Float32, Float64
  • Fixed assertion lines exceeding 100-character line length limit by breaking them across multiple lines

Note

This fix was previously applied in V2 via PR #3740 but was not carried over to the V3 (sagemaker-mlops) codebase.

Comments reviewed: 4
Files modified: sagemaker-mlops/src/sagemaker/mlops/feature_store/feature_utils.py, sagemaker-mlops/tests/unit/sagemaker/mlops/feature_store/test_feature_utils.py

  • sagemaker-mlops/src/sagemaker/mlops/feature_store/feature_utils.py: Add pandas nullable dtype support to _DTYPE_TO_FEATURE_TYPE_MAP and update _generate_feature_definition to use _STRING_TYPES
  • sagemaker-mlops/tests/unit/sagemaker/mlops/feature_store/test_feature_utils.py: Use pytest.mark.parametrize for nullable dtype tests, fix line length issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants