# BitNet: LLM Quantization at its Extreme

In the ongoing competition since 2022 to develop the most performant model in terms of output quality, benchmark scores, and context length, there has been tremendous growth in model size. With large language models (LLMs), one clear trend is emerging: size matters! This straightforward approach of improving outputs by simply scaling up leads to an exponential increase in both the number of model parameters and the volume of training data.

While about 0.5 trillion tokens were used for a model with 175 billion parameters for GPT-3 in 2020, the data volume for GPT-4 is estimated to be around 13 trillion tokens for an MoE model with (rumored) 16x111 billion parameters. Unfortunately, OpenAI does not release official figures for this. Even for models whose data is known – such as Llama2 with 2 trillion tokens and up to 70 billion parameters; Palm2 with 3.6 trillion tokens and 340 billion parameters; Llama 3 with 15 trillion tokens and up to 400B+ parameters – the dimensions have significantly increased, but with it also the resource demands regarding storage, memory and computational power.

To reduce their resource demand, the weights (parameters) within the model are often quantized, meaning their resolution is reduced. This is commonly done after training, in what is called post-training quantization. While this results in a slight loss of accuracy, this method is simple and allows the model to be proportionally scaled down. Microsoft adopted a different approach to make models inherently more efficient and less computationally intensive from the ground up. To circumvent this post-training quantization, Microsoft in their study pursued the approach of reducing the resolution of parameters to the absolute minimum from the beginning. They attempted to realize the feed-forward networks using only binary (1-bit!) parameters. In this post we will show and explain how this works and how this model performs compared to a reproduced Llama model with higher weight resultions but same parameter counts.

## The drawback of actual quantisation methods*

To quantize a large language model (LLM), there are two common methods: "Quantization-Aware Training" (QAT) and "Post-Training Quantization" (PTQ).

### QAT

*"Given a trained model, quantization may introduce a perturbation to the trained model parameters, and this can push the model away from the point to which it had converged when it was trained with floating point precision. It is possible to address this by re-training the NN model with quantized parameters so that the model can converge to a point with better loss. One popular approach is to use QuantizationAware Training (QAT), in which the usual forward and backward pass are performed on the quantized model in floating point, but the model parameters are quantized after each gradient update (similar to projected gradient descent). In particular, it is important to do this projection after the weight update is performed in floating point precision. Performing the backward pass with floating point is important, as accumulating the gradients in quantized precision can result in zerogradient or gradients that have high error, especially in low-precision"*

*"QAT has been shown to work despite the coarse approximation of STE. However, the main disadvantage of QAT is the computational cost of re-training the NN model. This re-training may need to be performed for several hundred epochs to recover accuracy, especially for low-bit precision quantization. If a quantized model is going to be deployed for an extended period, and if efficiency and accuracy are especially important, then this investment in re-training is likely to be worth it. However, this is not always the case, as some models have a relatively short lifetime."*

### PTQ

*"An alternative to the expensive QAT method is Post-Training Quantization (PTQ) which performs the quantization and the adjustments of the weights, without any fine-tuning. As such, the overhead of PTQ is very low and often negligible. Unlike QAT, which requires a sufficient amount of training data for retraining, PTQ has an additional advantage that it can be applied in situations where data is limited or unlabeled. However, this often comes at the cost of lower accuracy as compared to QAT, especially for low-precision quantization."*

*"In PTQ, all the weights and activations quantization parameters are determined without any re-training of the NN model. As such, PTQ is a very fast method for quantizing NN models. However, this often comes at the cost of lower accuracy as compared to QAT."*

So both methods have essential drawbacks. QAT is quite cost intensive as it requires an additional fine-tuning after the pre-training phase and PTQ is very simple but comes with the downside of decreasing the accuracy of the model by an uncertain amount.

## How to overcome these drawbacks?

To circumvent these drawbacks, Microsoft adopted a different approach to significantly enhance the efficiency of a transformer-based LLM, especially during the inference phase. The core idea is to eliminate the subsequent step of quantization by initially training the model with highly quantized parameters. They chose the most extreme method by using parameters with just 1-bit resolution. This model, named "BitNet", was introduced in the scientific paper "BitNet: Scaling 1-bit Transformers for Large Language Models" in October 2023.

## BitNet

BitNet follows the same structure as other transformers, using multiple stacked blocks of self-attention and feed-forward networks. However, unlike a standard transformer, BitNet utilizes the newly introduced BitLinear blocks instead of conventional matrix multiplications, which employ binarized model weights (1-bit).

By reducing the floating-point parameters to integer values of 1 and -1, the matrix multiplications (combinations of multiplications and additions) are simplified to mere integer additions, significantly lowering computational and energy demands.

However, not all parts of the transformer in BitNet are reduced to 1-bit, and there are several reasons for this. The following components remain in high-precision (8-bit):

- The residual connections and layer normalizations were not changed, as they contribute negligibly to the computational costs in LLMs.
- The attention heads were also not reduced in resolution, since the QKV transformation is significantly smaller than the parametric projection.
- The precision of the input/output embeddings was not reduced, as language models require high-precision probabilities for effective sampling.

### Model training

**STE**
For training the BitNet model, due to the non-differentiable functions in the forward path (Sign and Clip functions), the method of the Straight-Through Estimator (STE) was employed to circumvent these during backpropagation.

**Mixed precision training**
Due to the highly reduced resolution of the parameters, several additional measures had to be considered for training the model.

To ensure stability and accuracy during training, the gradients and optimizer states were used in high precision. The weights were also stored in high precision to accumulate the parameter updates in high resolution. For the forward pass, namely during the inference process, these weights were then binarized on the fly to assess the effects with the 1-bit parameters.

**Large learning rate**
The low resolution results in small update steps not causing changes in the weights, which presents a challenge for optimization. It has been found that the simplest and most effective method to counteract this is to significantly increase the learning rate. While this is necessary for BitNet to accelerate convergence, such high learning rates lead to divergence during training in the FP16 Transformer.

### Comparison to FP16 Transformer

To evaluate the performance and efficiency improvements of this concept it was compared to a Llama based FP16 Transformer which was trained on the same training data, parameter counts and tokenizer and vocabulary size to ensure a fair comparison.

The results show that the accuracy of the BitNet model also can be predicted accurately by the power-law. It further shows that upscaling the BitNet model brings the Loss close to the results achieved by the FP16 transformer. As BitNet was designed to strongly increase the efficency during inference the Loss over the required Inference energy consumption was investigated. It can be shown that BitNet achieves the same loss and accuracy levels with a much lower energy consumption than the FP16 Transformer.

BitNet accuracy vs. energy consumpation of zero-shot and few-shot inference

### Comparison to PTQ methods

The BitNet model was also compared to its directly competing state-of-the-art methods of post-training quantization. It was compared to weight-only quantizations (GPTQ and QuIP) and also to weight-and-activation quantizations (Absmax and SmoothQuant) at different quantization levels.

*"The results demonstrate the effectiveness of BitNet in achieving competitive performance levels compared to the baseline approaches, particularly for lower bit levels. The zero-shot scores of BitNet are comparable with the 8-bit models, while the inference cost is much lower. For the 4-bit models, the weight-only quantization methods outperform the weight-and-activation quantizers, mainly because the activation is more difficult to quantify. BitNet, as a 1-bit model, significantly achieves better results than both the weight-and-activation quantization methods and the weight-only methods. As for the lower-bit models, BitNet has consistently superior scores over all baselines. This proves the advantages of the quantization-aware training approaches over the post-training quantization methods."*

BitNet Benchmarks compared to PTQ methods at various quantization levels

*"Figure 6 summarizes both the zero-shot accuracy and few-shot accuracy of our method and the baselines while scaling up the model size from 1.3B to 6.7B. It proves that the advantage is consistent across different scales."*

BitNet zero-shot and few-shot accuracy compared to common PTQ methods

## BitNet 1.58b - the next step to efficient LLMs

Building on the BitNet model, Microsoft has further developed their concept by using ternary weights with possible values {-1, 0, 1} in the next iteration. Converted, this corresponds to 1.58 bits, which is why this model was named 'BitNet b1.58'.

The inclusion of 0 in the model weights significantly enhances its modeling capability by explicitly supporting feature filtering, which can greatly improve the performance of 1-bit LLMs. BitNet b1.58 maintains the same architecture as BitNet but requires some adjustments to the quantization function due to the ternary weights.

With this model, the study authors were able to achieve extraordinary results. By reducing the usual 16-bit floating-point values in the ANNs to just these three states, no complex multiplications need to be performed in the matrix calculations, only integer additions of the input values.

'BitNet b1.58' Pareto Improvement

The study compared BitNet b1.58 with a reproduced FP16 LLaMA LLM. Despite this significant reduction, BitNet b1.58 was able to achieve almost identical results in various benchmarks as the LLaMA model.

This quantization is only performed in the FFNs and not in other parts of the Transformer, which means that the attention layers, etc., are computed at higher resolutions. Thus, the size of the model is not directly reduced by the resolution reduction, but the effect becomes more significant for models with more parameters. For a model with 3B parameters, the required memory is already reduced by a factor of 3.55. For a 70B model, the memory requirement is reduced by a factor of 7.16 and is 4.1 times faster. Moreover, a throughput of 2977 tokens/s is achieved, which corresponds to an 8.9-fold increase.

This significantly reduced computational power means not only a much smaller memory footprint and higher throughput but also a significant reduction in the energy required. According to the study, the BitNet b1.58 model with 70B can reduce the end-to-end energy consumption by 41.2 times compared to the LLaMA 70B model.

'BitNet b1.58' energy consumption 'BitNet b1.58' resource reduction and speed improvements

The improvements achieved through this approach are so significant that, from a resource perspective, BitNet models can compete with LLaMA models with significantly fewer parameters. For example, it is stated that:

*13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.**30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.**70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.*

## Conclusion

The ongoing competition to achieve better benchmark scores and higher output quality has led to increasing model sizes, resulting in a significant rise in resource requirements for training and inference of these models. This escalation leads to higher costs, lower throughput, and increased energy consumption. To mitigate this, models are quantized, which curbs their size and consequently their resource demand. Typically, these models are adjusted either with a Quantization-Aware Training (QAT) method, which further complicates the training process, or through simpler Post-Training Quantization (PTQ) methods, which can slightly degrade their quality.

Microsoft has demonstrated with their developed concept implemented in the BitNet models that these resource savings can be achieved without additional post-training efforts and without loss of performance, provided the model's initial structure incorporates these quantizations.

They have even shown that their models can operate at the same quality level as a comparable FP16 LLM while having reduced energy and memory requirements, increased throughput, and reduced latency. Furthermore, it has been demonstrated that this solution scales well, with the highlighted benefits increasing as the model size grows.

These advantages also open up further application possibilities. For example, they enable easier deployment of LLM models on edge or mobile devices. Additionally, by combining with other techniques such as Mixture-of-Experts (MoE) or tailored hardware for 1-bit LLMs, such models could further push the performance boundaries of LLMs.