Antoine Chatry | Benchmark tool

I'm happy to share with you my latest project: CodeQuantBenchmark, a toolkit I developed to automate and standardize benchmarking of quantized models on code tasks.

This tool is designed to take you from raw data to meaningful metrics without having to glue a bunch of scripts together every time. It combines data preprocessing, optional fine-tuning support, multi-precision quantization workflows, and detailed benchmarking into a single pipeline. Here’s how it works:

1. Preparing the data

First, the toolkit loads your dataset of code examples.
Depending on the task, it can extract structural information from the code, for example, Abstract Syntax Trees (ASTs) so that evaluations aren’t limited to plain text matching.
CodeQuantBenchmark uses AST extraction to generate structural representations that can be fed into evaluation or prediction workflows, making it easier to analyze how models handle syntax and semantics in code.

2. Optional fine-tuning

Once data is ready, you can choose to fine-tune the model on your specific dataset.
This is useful if you want your model better adapted before you quantize or benchmark it.
The project includes utility scripts and examples to run a QLoRA-style fine-tuning, integrating low-rank adaptation on top of base models so you can improve accuracy prior to quantization.

3. Quantization

Quantization is about reducing the precision of a model to save memory and speed up inference.
CodeQuantBenchmark supports converting models into multiple quantized formats, for example, different GGUF precisions like Q2 up to Q8.
This is done by integrating existing converters and supporting various quantization strategies, such as dynamic scaling and minimizing quant error, so you can compare multiple quantization levels systematically.
By producing quantized checkpoints in standardized formats, you can then run the same evaluation scripts across them to see how precision affects model behavior.

4. Running benchmarks

With a quantized model ready, the tool runs evaluations using a variety of metrics:
• Quality metrics such as syntax validity, BLEU, or Jaccard similarity for comparing generated code to references
• Latency measures like tokens per millisecond to assess inference speed
• Memory usage tracking during inference to observe footprint and peak-RSS
These metrics help you understand the trade-offs between efficiency, model size and output quality, which is especially important when comparing different quant levels or formats.
CodeQuantBenchmark also provides unit and integration tests that serve as optional reference points to validate your pipelines.

5. Reportings results

As benchmarks complete, results are gathered and visualized. The toolkit produces plots and structured logs so you can easily compare experiments or reproduce them later.
Everything is designed to be modular, so you can add your own metrics, evaluation tasks, or export formats if needed. This makes it easy to create reproducible workflows for research or comparative experimentation.

Quick summary & technologies used

In short: you set up the pipeline, provide data and model checkpoints, run the benchmarks, and get a comprehensive view of model performance under quantization.

Technologies used:

Python for scripting and orchestration
AST extraction tools for structured code features
Quant conversion utilities supporting multi-precision formats
Metrics modules for quality and performance
Plotting tools for reports and summary visualisations
Unsloth Notebook for the generated .ipynb

Link to the project: https://github.com/AntoineChatry/CodeQuantBenchmark

I hope this article gives you a clear overview of how CodeQuantBenchmark works and how it can help with reproducible model benchmarking!