September 26, 2023
When I talk to machine learning engineers at conferences or community events, I find that some have never heard of the term “preemption”. It’s not suprising that once it’s explained, I’m met with reactions like “Whoa! I had no idea you could do that.” So what is preemption anyway?
💡 Preemption refers to stopping a task and saving its state such that it can be re-started at an appropriate time.
You are a machine learning engineer and you normally work on a cluster that has 8 available GPUs. You’re running a hyperparameter search experiment that you want to write a report on by the end of the day. You want to utilize the whole cluster, so you configure the experiment to run one trial per GPU. Easy enough with Determined – this is also a pretty efficient way to get your experiment to finish quickly.
Suddenly, your coworker at an important conference needs to run a demo RIGHT NOW. On your cluster. Because your team normally shares clusters, and you forgot that his demo was happening today.
What now? Normally, you’d just kill your experiments until the cluster is free again and tell your manager your deliverable will be a day late since a higher priority job needs to run. That is not ideal – what if there was a way to more elegantly solve this resource allocation conundrum?
Determined would solve this problem by preempting some of your trials. Let’s say your coworker’s demo needs 4 GPUs. Given that you’ve enabled priority scheduling and submitted jobs with the appropriate priority values (let’s call your experiment Priority 2 and your coworker’s experiment Priority 1), Determined would then preempt 4 of your trials and let your coworker’s demo job run before automatically resuming your experiment. That way, no emergency terminations of experiments, no strained demos and no late deliverables. Basically, no sweat. Just efficient resource allocation 😎
The example above illustrates how Determined’s priority scheduler uses preemption, but you can also just pause and resume your experiments yourself at will. Yes, there’s literally a button for that in our WebUI:
So if you want to manually make room for another workload on your cluster at any time of the day, be our guest.
Maybe in your team workflow, you have routine high urgency training workloads that necessitate you to free up nearly all nodes on your cluster like in the scenario above, or maybe you don’t. As an ML engineer, there are a lot of other things you might run into where preemption could be useful. Do any of these sound familiar?
If you are running Determined on Kubernetes, preemption also applies to you! Check out the docs here.
To recap, here are the two ways you can use preemption:
Preemption is a complex feature from an engineering perspective – that’s why you shouldn’t have to worry about it. If this article resonated with you, try us out! As always, we are available to help. Please join our Slack Community and don’t hesitate to reach out if you have any questions.