Machine Learning models are systems that learn patterns from data. This perspective has been prevalent in the era of traditional ML models, such as Logistic Regression and Random Forests. However, Deep Learning has significantly expanded the capabilities of ML models and enabled these models not only be optimised for the target metric but unintentionally serve as information compression systems for the training data.

For example, the training process of a model like LLaMA 7B effectively leads to the compression of ~4.7TB of data into ~13GB of model weights.

Inputs and outputs of model training. Example: LLaMA 7B

ML and Information Theory

From an information theory standpoint, ML models aim to minimize the description length of the training data following the principle of Minimum Description Length. They strive to distill and encapsulate the initial information and downstream intermediate representations (features) into a space constrained by the model’s architecture and the size of its trainable weights.

From a deep learning perspective, each layer is conceptualized as transforming the data into progressively more abstract and compact representations. This process filters out redundant or irrelevant information, preserving only what is essential for the task at hand.

Essentially, ML models leverage information compression as an intrinsic mechanism to enhance their ability to generalize to unseen data. By reducing the entropy of the output given the input, they effectively compress the information required to describe the mapping from inputs to outputs.

Quantifying Compression Efficiency

The connection between model performance and compression is formalized through cross-entropy —- the fundamental metric that measures how well a model’s learned distribution matches the true data distribution. Minimizing cross-entropy during training is mathematically equivalent to learning the most efficient encoding scheme for the data distribution. Lower cross-entropy means better prediction, which directly translates to better compression.

This theoretical relationship manifests practically through bits per byte (or bits per character). When a language model achieves a cross-entropy of 1.5 bits per character, this directly indicates the model can compress that text to 1.5 bits per character. State-of-the-art language models achieving 2-3 bits per byte naturally compress text more efficiently than generic algorithms like Gzip (typically 4-5 bits per byte).

LLMs as General-Purpose Compressors

Recent research findings from DeepMind demonstrate that LLMs can outperform standard and widely-used compression algorithms in terms of final data size. Remarkably, even models initially trained solely on text have been shown to function as general-purpose lossless compressors:

audio is reduced to 16.4% of its original size, outperforming FLAC
images are compresses to 43.4%, surpassing PNG.

Additionally, previous studies show that DNNs of various architectures (autoencoders, feedforward networks, transformers) are capable of executing both lossy and lossless compression.

The Reverse: Gzip for Classification

Exploring the intersection of compression and machine learning from a reverse angle led to some interesting insights.

The “Less is More: Parameter-Free Text Classification with Gzip” paper utilizes the Gzip compression algorithm in conjunction with a basic k-nearest-neighbor classifier to address text classification challenges. This idea, which has been surfacing in the tech community for a while, demonstrates connections between compression and generalization concepts.