AI News #21

Here’s what caught our eye last week.

Instruction Hierarchy

  • This paper proposes giving different priority levels to instructions from different sources. For example, the system-level prompt has higher priority than the user prompt, which has higher priority than tool outputs. Training a model to follow this type of hierarchy results in better resilience to jailbreaking attacks.
  • Paper

Multi-head MoE

  • Splits token embeddings into sub-token embeddings, which get routed to different experts, and merged afterwards.
  • Paper

Retrieval Heads

  • This paper identifies attention heads that activate when retrieving data from long contexts. Deactivating these “retrieval heads” increases hallucinations much more than deactivating other heads. They observe this behavior across many LLMs.
  • Paper

Better Synthetic Data

  • Obtain better synthetic data by transforming existing datasets into a format tailored for a specific task.
  • Paper


  • Faster pre-training of CLIP-level vision models, by training just a vision model with classification loss, instead of a vision model + text model using contrastive loss. A classification loss is enabled by extracting nouns from each image caption and mapping them to WordNet synsets.
  • Paper


  • Use specific transformer blocks only for specific input tokens (like spaces), to improve the speed of byte-level-token models.
  • Paper


  • Faster LLMs with a combination of layer dropout and self-speculation decoding (early-exit inference + verification using all layers).
  • Paper


  • New fully open-source LLM (dataset, code, and weights publicly available) that outperforms other fully open-source LLMs.
  • Paper


  • New open-source model. Phi-3 mini (3.8 billion parameters) rivals Mixtral 8x7B
  • Paper

Snowflake Arctic

Stay up to date

Interested in future weekly updates? Stay up to date by joining our Slack Community!