AI News #12

Here’s what caught our eye the past two weeks.

New Models


Stable Diffusion 3

  • The latest iteration of Stable Diffusion text-to-image models, ranging in size from 800M to 8B parameters.
  • The models use a diffusion transformer architecture, and will accept multimodal input.
  • Announcement post.

Mistral Large

  • Mistral’s new flagship model outperforms competing products (except GPT-4) on multiple benchmarks.
  • Has a 32k token context length, and strong multilingual capabilities.
  • Announcement post.

Nemotron-4 15B

  • A 15B parameter model by Nvidia, trained on 8 trillion tokens.
  • Trained on 384 DGX H100 nodes, where each node contains 8 H100 80 GB GPUs.
  • Paper.


  • 2B and 7B open source models by Google. Outperforms similarly sized models on 11 out of 18 tasks.
  • Trained on 2 trillion and 6 trillion tokens respectively. Has a 256k vocabulary size.
  • Announcement post

Gemini 1.5

  • Multimodal model by Google. The publicly available version has a 1 million token context length.
  • Has been criticized for the way it responds to certain requests.
  • Twitter summary.

Large World Model

  • Video and language model with a 1 million token context length. Uses an optimized version of RingAttention.
  • Project page.


  • State-of-the-art 1-step open-source diffusion model for text-to-image generation.
  • Paper.


  • 25 Mistral models finetuned on different tasks.
  • Project page.


  • Multimodal model covering 10 languages. Model and code to be released soon.
  • Paper.


  • A 125M and 350M parameter LLM, with state-of-the-art performance compared to similarly sized models.
  • Paper.

New Datasets

A Touch, Vision, and Language Dataset for Multimodal Alignment

  • Multimodal dataset of 44k image-touch pairs.
  • Project page.

Aria Everyday Activities Dataset

  • Egocentric multimodal dataset. Data includes 3d point clouds, trajectories, and speech transcriptions
  • Project page.


New Research

Universal Manipulation Interface

  • Training robots via hand-held manipulators operated by humans.
  • Project page.


  • Use different learning rates for LoRA’s \(A\) and \(B\) matrices, for better performance and faster convergence.
  • Paper and GitHub repo.


  • A model that can generate interactive visual environments (specifically platformer games), trained entirely on videos.
  • Paper.

Neural Network Diffusion

  • Diffusion models that generate the parameters of other neural networks.
  • Paper.


  • Extends the context window of a pretrained 256k-context-length LLM, to 2 million.
  • Paper.

Chain-of-Thought Reasoning Without Prompting

  • Introduces “chain-of-thought decoding”, which means decoding the top-k paths, and selecting the final answer based on the most confident decoded path.
  • Paper

Repetition Improves Language Model Embeddings

  • Obtains higher quality embeddings by passing the input into the model twice and using the embedding from the 2nd occurrence of the input.
  • Paper

How to Train Data-Efficient LLMs

  • Ask an LLM to rate the quality of each sample in a dataset, and train on just the top-rated samples.
  • Paper

Stay up to date

Interested in future weekly updates? Stay up to date by joining our Slack Community!