Skip to content

Floating point quantization custom op and datatypes #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

maltanar
Copy link
Collaborator

@maltanar maltanar commented May 8, 2025

This is a merger of PRs #180 #178 and #159 which introduce the following features:

  • Human-readable specification for FloatQuant op for arbitrary-precision floating point quantization custom op
  • FloatQuant custom op and unit tests for its QONNX execution
  • Arbitrary-precision float QONNX datatype annotations and unit tests
  • Renaming of old Quant to IntQuant (with backwards compatibility for now)

The spec for FloatQuant was put together by myself, Ian Colbert, Nicholas Fraser, Giuseppe Franco and Jakoba Petri-Koenig from AMD. Most of the QONNX-side contributions here are by @nghielme and @ebby-s with additional reviewing by Shane Fleming (AMD) so big kudos to them.

maltanar and others added 28 commits July 31, 2024 10:15
…on can be found in the `Examples`. `±inf` are clipped to `±max_val`. `±NaN` are mapped to `±NaN`. The zero is always representable. I tested with subnormals (to be intended as subnormals for the output representation) and the quantizer represented the subnormals with no loss (I didn't extensively tested this part though). I tested the function against Brevitas `FloatQuant` implementation: they do not always match. For example I think `0.3125` should be representable (`x == xq`) by a float quantizer with 4bits for mantissa, 4bits for the exponent, 0 bias and 1bit for the sign. Brevitas `FloatQuant` implementation quantize it to `0.25`. Not sure what I should consider correct for this case.
Co-authored-by: Nicolo Ghielmetti <nicolo.ghielmetti@gmail.com>
… quantization logic. Now QONNX and Brevitas float quantisers match.
…ion. Default exponent bias is now computed if not provided, and tests have been added to compare QONNX and Brevitas float quantization outputs.
A first sample version of `FloatQuant`
Merged with float_quant, added exp bias, implemented FloatQuant.infer_node_datatype()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants