MambaByte, Multimodal Pathway, and CrossMAE

By Kevin Musgrave

January 29, 2024

Here’s what caught our eye last week in AI.

MambaByte

LLMs are trained on tokenized data. Tokenization is the process of breaking down text into smaller units. These could be words, sub-words, or characters. Decomposing into sub-words is the most common, as it provides a good tradeoff between generalizability to new vocabulary, and sequence length (the shorter the tokens, the more tokens are needed to represent a given piece of text). Sequence length is important because most LLMs are built using the transformer architecture, and the transformer’s self-attention mechanism has a computational complexity that scales quadratically with sequence length.

Not all LLM architectures are like this though. For example, structured state space models (SSMs) scale linearly with sequence length. Given these linearly scaling models, is it practical to tokenize text into components smaller than sub-words? What about removing the tokenization step entirely?

This new paper proposes MambaByte, a Mamba state space model trained on raw bytes, i.e. without tokenization. The advantage of raw bytes is that it removes biases introduced in the tokenization step, and as the paper states, other issues “such as a lack of robustness to typos, spelling and capitalization variations, and morphological changes”. MambaByte outperforms existing models trained on raw bytes, and generates text faster than transformers while being competitive in terms of accuracy.

Loss curves from the MambaByte paper

Figure 1 from the MambaByte paper, showing the bits-per-byte loss vs training step and training FLOPs.

Multimodal Pathway

Multimodal models are trained on different modalities of data, like images, text, and audio. Typically, the different modalities are directly related to each other. For example, CLIP is trained on image-text pairs where in each pair, the text describes the image.

But what if we have unrelated multimodal data, like a dataset of images and a dataset of text, with no relationship between them? Could we somehow leverage the dataset of text to improve a model trained on images, and vice versa? A new paper claims to do just this, and their method is called Multimodal Pathway.

Given the following:

a model M_X trained on modality X,
a model M_Y trained on modality Y,

their method modifies M_X by injecting the weights of model M_Y, so that the computation at an arbitrary layer uses the sum of their weights: W_X + λ(W_Y), where λ is a hyperparameter.

Multimodal Pathway architecture

Figure 2 from the Multimodal Pathway paper, illustrating how weights from M_Y (yellow) are injected into model M_X (blue).

With both sets of weights contained in M_X, they then finetune M_X further on the X modality. After finetuning, they merge the W_x and W_y weights as W=W_X + λ(W_Y). Thus at inference time, the computational cost is the same as the original M_X model.

Their results show improvements on a variety of datasets and modalities. For example, their point-cloud model shows relative improvements between 1.5% and 5.7% on standard benchmarks, when augmented with video, image, or audio models.

Multimodal Pathway results

Figure 3 from the Multimodal Pathway paper, showing how image, video, point-cloud, and audio models are improved by models from other modalities.

CrossMAE

Masked Autoencoders (MAE) are a type of vision transformer model used for self-supervised learning of image features. Self-supervised learning means learning from data without human annotations. Typically this involves contrastive learning as is the case with CLIP, or solving a reconstruction problem, as is the case with MAE.

During MAE’s training step, the model tries to reconstruct randomly masked parts of an image, while only seeing the unmasked parts. MAE uses a vision transformer, so it computes the self-attention between all image patches, both masked and unmasked. A new paper suggests that this is inefficient and unnecessary, because according to their analysis, the masked tokens often only attend to the unmasked tokens.

Thus, they present a new method called CrossMAE. Its most notable differences from MAE are that:

It uses cross-attention, so that masked tokens only attend to unmasked tokens.
It reconstructs only a subset of all masked tokens.

The result is performance on-par and sometimes better than MAE, with fewer computations.

CrossMAE results

Figure 3 from the CrossMAE paper, illustrating the difference between CrossMAE and MAE.

New models

New OpenAI embedding models at a lower price per token, available through their API.
DeepSeek-Coder, a new series of open-source code-generation LLMs.

Other news

Lumiere, a new video generation model. See generation samples here.
Self-Rewarding Language Models, a new LLM architecture that evaluates its own responses.
Hourglass Diffusion Transformers, a new image-generation model that scales linearly with pixel count.
Weight Averaged Reward Models, trains multiple reward models (in the context of Reinforcement Learning with Human Feedback (RLHF)) and averages their weights to obtain a single model.

Stay up to date

Interested in future weekly updates? Stay up to date by joining our Slack Community!

MambaByte, Multimodal Pathway, and CrossMAE

MambaByte

Multimodal Pathway

CrossMAE

New models

Other news

Stay up to date

Recent Posts

Finding the best LoRA parameters

Summer '24 Conference Recap

How does Video Generation work?