AI News #20

By Isha Ghodgaonkar

April 22, 2024

Here’s what caught our eye last week.

Research

An approach that automatically converts videos of real-world scenes into realistic, interactive game environments.
Paper

alt text

Given a letter and a text description of an animation, this method transforms the letter into the animation. Check out the example below:
Paper

alt text

A new benchmark for multimodal LLMs that focuses on core visual perception abilities that other benchmarks don’t cover.
Most tasks in this benchmark can be solved by humans in just a “blink”, but current multimodal LLMs struggle - while humans get 95.70% accuracy on average, GPT-4V and Gemini only achieve accuracies of 51.26% and 45.72%.
Paper

alt text

Introduces a new Direct Preference Optimization (DPO) algorithm called Trust Region DPO (TR-DPO), which updates the reference policy (commonly the Supervised Fine Tuning model) during training.
Outperforms DPO by up to 19%.
Paper

alt text

Introduces Megalodon, a neural architecture for efficient sequence modeling with unlimited context length.
Based on Mega.
Comparable to Llama-2B and 13B.
Code
Paper

LLMs have shown promise in healthcare settings, such as medical Q/A. But the stakes are much higher when using an LLM based tool in a clinical setting - mistakes can be harmful or fatal.
The benchmark covers general medical knowledge, clinical knowledge, anatomy, genetics, and more, to help bridge the gap between LLM potential for medical use cases and proper evaluation.
Backend uses Eleuther AI Language Model Evaluation Harness.
Read more

alt text

A new benchmark for assessing LLM code generation capabilities
Contains a standard code generation task, as well as more difficult tasks for more robust, next-generation AI assistant capability testing:
1. Code Generation: Standard task to generate a correct solution to a natural language description.
2. Self Repair. Generate a code fix given error feedback.
3. Code Execution: Predict the output of a program on a given input.
4. Test Output Prediction: Same as Code Execution, but the program is not actually implemented, only described.

Interested in future weekly updates? Stay up to date by joining our Slack Community!