fix: load_feature_definitions_from_dataframe() doesn't recognize pandas nullable dtyp (5675)#5732
Conversation
…as nullable dtyp (5675)
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR fixes a bug where pandas nullable dtypes (Int64, Float64, string) were not recognized by load_feature_definitions_from_dataframe(), causing them to incorrectly fall through to StringFeatureDefinition. The fix adds the capitalized pandas nullable dtype names to the existing type sets. The approach is correct and tests are comprehensive, but there are a few issues: the _STRING_TYPES constant is defined but never used in _generate_feature_definition, and some test lines exceed the 100-character limit.
| # pandas nullable float dtypes | ||
| "Float32", "Float64", | ||
| } | ||
| _STRING_TYPES = {"object", "string"} |
There was a problem hiding this comment.
Bug: _STRING_TYPES is defined but never used. You've added the _STRING_TYPES = {"object", "string"} constant, but the diff doesn't show any changes to _generate_feature_definition to actually use it. Without updating the function to check dtype_name in _STRING_TYPES, the "string" dtype will still fall through to the default case. Please update _generate_feature_definition to use _STRING_TYPES for the string type check, e.g.:
elif dtype_name in _STRING_TYPES:
return FeatureDefinition(feature_name=column, feature_type="String")Can you confirm that the _generate_feature_definition function is also updated to use _STRING_TYPES? If not, the test_infers_string_type_with_pandas_string_dtype test would fail.
| def test_infers_correct_types_after_convert_dtypes(self): | ||
| df = pd.DataFrame({ | ||
| "id": [1, 2, 3], | ||
| "price": [1.1, 2.2, 3.3], |
There was a problem hiding this comment.
These assertion lines exceed the 100-character line length limit. Consider breaking them across multiple lines for readability:
result = next(d for d in defs if d.feature_name == "nullable_float")
assert result.feature_type == "Fractional"| @@ -49,6 +49,78 @@ def test_returns_correct_count(self, sample_dataframe): | |||
| defs = load_feature_definitions_from_dataframe(sample_dataframe) | |||
| assert len(defs) == 3 | |||
|
|
|||
There was a problem hiding this comment.
Consider using pytest.mark.parametrize to consolidate the individual nullable integer dtype tests (Int8, Int16, Int32, Int64, UInt32, UInt64) into a single parameterized test. This reduces duplication and makes it easier to add new dtypes in the future:
@pytest.mark.parametrize("dtype", ["Int8", "Int16", "Int32", "Int64", "UInt8", "UInt16", "UInt32", "UInt64"])
def test_infers_integral_type_with_pandas_nullable_int(self, dtype):
df = pd.DataFrame({"id": pd.Series([1, 2, 3], dtype=dtype)})
defs = load_feature_definitions_from_dataframe(df)
assert defs[0].feature_type == "Integral"Same applies to the Float32/Float64 tests.
| @@ -46,8 +46,19 @@ | |||
| "float64": "Fractional", | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Minor: The _DTYPE_TO_FEATURE_TYPE_MAP dict above should also be updated to include the pandas nullable dtype mappings (e.g., "Int64": "Integral", "Float64": "Fractional", "string": "String") for consistency, even if it's not the active code path. This prevents future confusion if someone tries to use the map instead of the sets.
🤖 Iteration #1 — Review Comments AddressedDescriptionFix ProblemWhen a DataFrame uses pandas nullable dtypes (common after calling Changes
NoteThis fix was previously applied in V2 via PR #3740 but was not carried over to the V3 (sagemaker-mlops) codebase. Comments reviewed: 4
|
Description
The issue is in
sagemaker-mlops/src/sagemaker/mlops/feature_store/feature_utils.py. The_INTEGER_TYPESand_FLOAT_TYPESsets only contain lowercase numpy dtype names (e.g., 'int64', 'float64'). Pandas nullable dtypes use capitalized names (e.g., 'Int64', 'Float64', 'string') and are not matched, causing all nullable-typed columns to fall through toStringFeatureDefinition. The fix is to add pandas nullable dtype names to_INTEGER_TYPESand_FLOAT_TYPES, and also add 'string' to the string-type handling in_generate_feature_definition. The referenced PR #3740 fixed this in V2 but the fix was not carried over to the V3 (sagemaker-mlops) codebase. Additionally, the_DTYPE_TO_FEATURE_TYPE_MAPdict already has 'string' mapped but is not used by_generate_feature_definition; however the sets approach is the active code path, so we fix the sets.Related Issue
Related issue: 5675
Changes Made
sagemaker-mlops/src/sagemaker/mlops/feature_store/feature_utils.pysagemaker-mlops/tests/unit/sagemaker/mlops/feature_store/test_feature_utils.pyAI-Generated PR
This PR was automatically generated by the PySDK Issue Agent.
Merge Checklist
prefix: descriptionformat