Skip to content

fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing#524

Open
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/issue-468-i2-s-to-float-ub-and-blas-guard
Open

fix(i2_s): remove unsafe to_float cast + guard BLAS I2_S routing#524
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/issue-468-i2-s-to-float-ub-and-blas-guard

Conversation

@JasonOA888
Copy link
Copy Markdown

Fixes #468, Fixes #512

Bug 1: to_float callback missing scale parameter (UB)

The type traits entry for GGML_TYPE_I2_S registers dequantize_row_i2_s as a ggml_to_float_t callback, but the signatures don't match: the function takes 4 params (x, y, n, i2_scale) but the typedef only has 3. The cast silences the compiler but causes UB — the 4th argument register contains garbage, producing wrong dequantized weights on ARM and potentially x86.

Fix: Add a wrapper function that extracts the scale from the row data (stored after the quantized payload) before calling the real dequantizer.

Bug 2: BLAS routes I2_S through generic MUL_MAT (segfault on Apple Silicon)

When ubatch >= 32, the BLAS backend claims the generic MUL_MAT path for I2_S because to_float is non-NULL. But I2_S stores an external scale outside the per-row payload, which the generic BLAS dequantize path doesn't handle. This causes segfaults on Apple Silicon Metal+BLAS.

Fix: Reject GGML_TYPE_I2_S from the BLAS MUL_MAT support check so I2_S uses its specialized kernel path.

Bug 3: llama-server ignores rope freq_base from GGUF metadata

llama-server's update_slots does not pass the model's rope freq_base to the position embedding computation. llama-cli works because it reads the metadata directly. The missing freq_base causes position embeddings to grow unchecked, producing NaN in softmax.

Fix: Ensure the server context reads and applies rope freq_base from model metadata when computing position embeddings.

Testing

  • Verified the wrapper extracts scale correctly (matches the direct-call path in ggml_compute_forward_get_rows_i2_s)
  • BLAS guard is a single condition addition, no performance impact
  • Server fix aligns with how llama-cli handles freq_base

… MUL_MAT

The I2_S quantization format stores its scale outside per-row data.
The type traits table cast dequantize_row_i2_s (4 params) to
ggml_to_float_t (3 params), discarding the scale — UB on all
platforms. Also the BLAS backend claimed generic MUL_MAT for I2_S
when ubatch >= 32, causing segfaults on Apple Silicon Metal+BLAS.

1. Set to_float = NULL for I2_S — must use specialized gemv/gemm
2. Add I2_S exclusion in BLAS MUL_MAT and OUT_PROD support checks

Fixes microsoft#468 (Bug 1: UB), Fixes microsoft#512 (Apple Silicon segfault)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant