Last week, we went to NeurIPS to spread the word about Determined and catch up on the latest research. Here are some of our takeaways:
- Value of Training LLMs: We talked to both academic and industry researchers about how finetuning and even pretraining models unlocks significant value.
- Importance of Data: We observed increasing recognition of the importance of data both during training and evaluation, especially as model sizes scale up. There is increasing interest in synthetic data due to the emerging “token crisis” as we squeeze all the value out of available internet data (there is evidence that value of data diminishes drastically over 4 epochs).
- Renewed Excitement around MoE: there was lots of discussion about Mixture of Experts given the release of Mistral’s new models. Parameter efficiency, scaling laws, and gpu memory + latency tradeoffs are all current topics of interest.
- Alternative Architectures: Mamba models are one of the first transformer-alternatives to demonstrate comparable performance. The authors recently released model code, weights, and a blog post. Check it out on X. Also check out RWKV, another novel architecture that combines transformers with RNNs.
- RAG: We saw lots of exhibits feature Retrieval Augmented Generation as a focus for practical industry use cases.
- AI for Science: Lots of our conversations at the Determined booth centered around researchers applying machine learning to DNA sequences, protein data, drug discovery, chemistry, etc. The main conference featured multiple posters on these topics. There were also workshops on AI for science, including climate change and drug discovery.
Panel on Beyond Scaling
This panel talked about scaling LLMs, specifically:
- Challenges of training models at scale, e.g. building system tools to detect silent GPU failures where matrix multiplication is being computed incorrectly.
- Challenges of model alignment and sensitivity to preference data.
- Model evaluation challenges.
Christopher Re’s Talk on Foundation Models
Christopher Re, Associate Professor at Stanford, gave a talk titled “Systems for Foundation Models, and Foundation Models for Systems”. Some takeaways:
- LLMs completely change the paradigm for “death-by-a-thousand cuts” problems, like data cleaning.
- We don’t need GPT4 for everything - there is a lot of value to extract from finetuning existing open models, as well as using smaller models.
- There’s lots of exciting research to be done at the intersection of systems and machine learning. FlashAttention is one such example of taking core concepts from database systems like chunking and operator fusion to drastically speed up attention while reducing gpu memory.
- The talk also briefly covered the work his lab has done on State Space models and their benefits for long sequences as well as connection to CNN and RNNs.
Workshop on Instruction Tuning
This workshop featured several keynotes - we highly recommend listening to Sara Hooker and Alex Tamkin’s talk (6:00:00 in the recording) - and covered topics like:
- The difficulty of collecting high quality human preference data.
- Alignment gaps for underrepresented languages, and the work Cohere is doing to expand language coverage (AYA project).
- New interaction paradigms beyond instruction tuning. For example, language models eliciting preferences from humans by asking questions.
The field is moving so quickly that many people at the workshop were already talking about OpenAI’s paper on how weak supervision from a less capable model can be used to improve a stronger model (implicitly mapping how humans might supervise superhuman AI in the future).
Workshops on efficient techniques for training & efficient NLP
These workshops (Efficient Techniques for Training, Efficient NLP) covered:
Stay up to date
Interested in future weekly updates? Stay up to date by joining our Slack Community!