fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing#524
Open
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
Open
fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing#524JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
Conversation
… MUL_MAT The I2_S quantization format stores its scale outside per-row data. The type traits table cast dequantize_row_i2_s (4 params) to ggml_to_float_t (3 params), discarding the scale — UB on all platforms. Also the BLAS backend claimed generic MUL_MAT for I2_S when ubatch >= 32, causing segfaults on Apple Silicon Metal+BLAS. 1. Set to_float = NULL for I2_S — must use specialized gemv/gemm 2. Add I2_S exclusion in BLAS MUL_MAT and OUT_PROD support checks Fixes microsoft#468 (Bug 1: UB), Fixes microsoft#512 (Apple Silicon segfault)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #468, Fixes #512
Bug 1: to_float callback missing scale parameter (UB)
The type traits entry for GGML_TYPE_I2_S registers dequantize_row_i2_s as a ggml_to_float_t callback, but the signatures don't match: the function takes 4 params (x, y, n, i2_scale) but the typedef only has 3. The cast silences the compiler but causes UB — the 4th argument register contains garbage, producing wrong dequantized weights on ARM and potentially x86.
Fix: Add a wrapper function that extracts the scale from the row data (stored after the quantized payload) before calling the real dequantizer.
Bug 2: BLAS routes I2_S through generic MUL_MAT (segfault on Apple Silicon)
When ubatch >= 32, the BLAS backend claims the generic MUL_MAT path for I2_S because to_float is non-NULL. But I2_S stores an external scale outside the per-row payload, which the generic BLAS dequantize path doesn't handle. This causes segfaults on Apple Silicon Metal+BLAS.
Fix: Reject GGML_TYPE_I2_S from the BLAS MUL_MAT support check so I2_S uses its specialized kernel path.
Bug 3: llama-server ignores rope freq_base from GGUF metadata
llama-server's update_slots does not pass the model's rope freq_base to the position embedding computation. llama-cli works because it reads the metadata directly. The missing freq_base causes position embeddings to grow unchecked, producing NaN in softmax.
Fix: Ensure the server context reads and applies rope freq_base from model metadata when computing position embeddings.
Testing