NeurIPS 2023 - What's the buzz?

By Isha Ghodgaonkar, Liam Li

December 19, 2023

Last week, we went to NeurIPS to spread the word about Determined and catch up on the latest research. Here are some of our takeaways:

General Takeaways:

Value of Training LLMs: We talked to both academic and industry researchers about how finetuning and even pretraining models unlocks significant value.
Importance of Data: We observed increasing recognition of the importance of data both during training and evaluation, especially as model sizes scale up. There is increasing interest in synthetic data due to the emerging “token crisis” as we squeeze all the value out of available internet data (there is evidence that value of data diminishes drastically over 4 epochs).
Renewed Excitement around MoE: there was lots of discussion about Mixture of Experts given the release of Mistral’s new models. Parameter efficiency, scaling laws, and gpu memory + latency tradeoffs are all current topics of interest.
Alternative Architectures: Mamba models are one of the first transformer-alternatives to demonstrate comparable performance. The authors recently released model code, weights, and a blog post. Check it out on X. Also check out RWKV, another novel architecture that combines transformers with RNNs.

Model code: https://t.co/iASe4J6Rb6
Model weights: https://t.co/euk2vGpBE6
Blogpost: https://t.co/tVSSrNy5OS https://t.co/46KnMh1KNJ
2/
— Tri Dao (@tri_dao) December 12, 2023

RAG: We saw lots of exhibits feature Retrieval Augmented Generation as a focus for practical industry use cases.
AI for Science: Lots of our conversations at the Determined booth centered around researchers applying machine learning to DNA sequences, protein data, drug discovery, chemistry, etc. The main conference featured multiple posters on these topics. There were also workshops on AI for science, including climate change and drug discovery.

Panel on Beyond Scaling

This panel talked about scaling LLMs, specifically:

Challenges of training models at scale, e.g. building system tools to detect silent GPU failures where matrix multiplication is being computed incorrectly.
Challenges of model alignment and sensitivity to preference data.
Model evaluation challenges.

Christopher Re’s Talk on Foundation Models

Christopher Re, Associate Professor at Stanford, gave a talk titled “Systems for Foundation Models, and Foundation Models for Systems”. Some takeaways:

LLMs completely change the paradigm for “death-by-a-thousand cuts” problems, like data cleaning.
We don’t need GPT4 for everything - there is a lot of value to extract from finetuning existing open models, as well as using smaller models.
There’s lots of exciting research to be done at the intersection of systems and machine learning. FlashAttention is one such example of taking core concepts from database systems like chunking and operator fusion to drastically speed up attention while reducing gpu memory.
The talk also briefly covered the work his lab has done on State Space models and their benefits for long sequences as well as connection to CNN and RNNs.

Workshop on Instruction Tuning

This workshop featured several keynotes - we highly recommend listening to Sara Hooker and Alex Tamkin’s talk (6:00:00 in the recording) - and covered topics like:

The difficulty of collecting high quality human preference data.
Alignment gaps for underrepresented languages, and the work Cohere is doing to expand language coverage (AYA project).
New interaction paradigms beyond instruction tuning. For example, language models eliciting preferences from humans by asking questions.

The field is moving so quickly that many people at the workshop were already talking about OpenAI’s paper on how weak supervision from a less capable model can be used to improve a stronger model (implicitly mapping how humans might supervise superhuman AI in the future).

Workshops on efficient techniques for training & efficient NLP

These workshops (Efficient Techniques for Training, Efficient NLP) covered:

Luke Zettlemoyer’s work on embarassingly parallel training of language models: Branch Train Merge (BTM) and cluster BTM
Sparse backprop for MoE
MatFormer: Nested MLP layers borrowing ideas from neural architecture search
Tuning transformers to ~70% model FLOPS efficiency by Aleph Alpha
Identify subnetworks from pretrained network optimal for multi-objective criteria with a one-shot NAS supernetwork

Stay up to date

Interested in future weekly updates? Stay up to date by joining our Slack Community!

NeurIPS 2023 - What's the buzz?

General Takeaways:

Panel on Beyond Scaling

Christopher Re’s Talk on Foundation Models

Workshop on Instruction Tuning

Workshops on efficient techniques for training & efficient NLP

Stay up to date

Recent Posts

Finding the best LoRA parameters

Summer '24 Conference Recap

How does Video Generation work?