Skip to content

zerfoo/float8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

float8

Go Reference License

FP8 E4M3FN arithmetic library for Go, commonly used in quantized ML inference.

Part of the Zerfoo ML ecosystem.

Features

  • IEEE 754 FP8 E4M3FN format — 1 sign, 4 exponent, 3 mantissa bits
  • Fast lookup tables — optional pre-computed tables for arithmetic and conversion
  • Full arithmetic — add, subtract, multiply, divide, sqrt, abs, neg
  • No infinities — the E4M3FN variant uses the infinity encoding for additional finite values
  • Zero dependencies — pure Go, no CGo

Installation

go get github.com/zerfoo/float8

Requires Go 1.26+.

Quick Start

package main

import (
    "fmt"
    "github.com/zerfoo/float8"
)

func main() {
    a := float8.FromFloat32(3.14)
    b := float8.FromFloat32(2.71)

    sum := a.Add(b)
    product := a.Mul(b)

    fmt.Printf("a = %f\n", a.ToFloat32())
    fmt.Printf("a + b = %f\n", sum.ToFloat32())
    fmt.Printf("a * b = %f\n", product.ToFloat32())
}

Format

Field Bits Description
Sign 1 0 = positive, 1 = negative
Exponent 4 Biased by 7, range [-6, 7]
Mantissa 3 3 explicit + 1 implicit leading bit

Special values: ±0 (exp=0, mant=0), NaN (exp=1111, mant=111). No infinities.

Performance Modes

// Enable lookup tables for faster arithmetic (trades memory for speed)
float8.EnableFastArithmetic()
float8.EnableFastConversion()

Used By

  • ztensor — GPU-accelerated tensor library

License

Apache 2.0

About

FP8 (E4M3FN) arithmetic library for Go. Configurable precision/range trade-offs for quantized ML inference and memory-constrained applications.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors