Optimizing Workloads: The Preemption Puzzle Solved by Determined AI


Preemption: A Closer Look

When I talk to machine learning engineers at conferences or community events, I find that some have never heard of the term “preemption”. It’s not suprising that once it’s explained, I’m met with reactions like “Whoa! I had no idea you could do that.” So what is preemption anyway?

💡 Preemption refers to stopping a task and saving its state such that it can be re-started at an appropriate time.

Let’s imagine a scenario

You are a machine learning engineer and you normally work on a cluster that has 8 available GPUs. You’re running a hyperparameter search experiment that you want to write a report on by the end of the day. You want to utilize the whole cluster, so you configure the experiment to run one trial per GPU. Easy enough with Determined – this is also a pretty efficient way to get your experiment to finish quickly.

Suddenly, your coworker at an important conference needs to run a demo RIGHT NOW. On your cluster. Because your team normally shares clusters, and you forgot that his demo was happening today.

What now? Normally, you’d just kill your experiments until the cluster is free again and tell your manager your deliverable will be a day late since a higher priority job needs to run. That is not ideal – what if there was a way to more elegantly solve this resource allocation conundrum?

How would Determined solve this?

Determined would solve this problem by preempting some of your trials. Let’s say your coworker’s demo needs 4 GPUs. Given that you’ve enabled priority scheduling and submitted jobs with the appropriate priority values (let’s call your experiment Priority 2 and your coworker’s experiment Priority 1), Determined would then preempt 4 of your trials and let your coworker’s demo job run before automatically resuming your experiment. That way, no emergency terminations of experiments, no strained demos and no late deliverables. Basically, no sweat. Just efficient resource allocation 😎

How else can you use preemption?

The example above illustrates how Determined’s priority scheduler uses preemption, but you can also just pause and resume your experiments yourself at will. Yes, there’s literally a button for that in our WebUI:

pause resume button

So if you want to manually make room for another workload on your cluster at any time of the day, be our guest.

Why else might you need preemption?

Maybe in your team workflow, you have routine high urgency training workloads that necessitate you to free up nearly all nodes on your cluster like in the scenario above, or maybe you don’t. As an ML engineer, there are a lot of other things you might run into where preemption could be useful. Do any of these sound familiar?

  • You’re training a very large model (like an LLM) over the course of days or weeks, which takes up the cluster, and it’s be nice if you could intermittently run jobs to finetune smaller models so you can efficiently alternate between both goals.
  • You’re experimenting with a new training script (or two or three). You were interested in the training behavior at the beginning of one of your test experiments. Later, you realize you’re interested in seeing how that specific experiment progressed a bit more – but you have to restart the whole thing and wait. It would have been nice to have had the ability to pause it and hit resume when needed.
  • You want to avoid job disruption during cluster upgrades.

Kubernetes support

If you are running Determined on Kubernetes, preemption also applies to you! Check out the docs here.

Conclusion

To recap, here are the two ways you can use preemption:

  1. Priority scheduling mode
  2. Manual experiment pause/resume

Preemption is a complex feature from an engineering perspective – that’s why you shouldn’t have to worry about it. If this article resonated with you, try us out! As always, we are available to help. Please join our Slack Community and don’t hesitate to reach out if you have any questions.