Getting the most out of your GPU cluster for deep learning: part I

To maximize the value of your deep learning hardware, you’ll need to invest in software infrastructure. Setting up a cluster manager is an essential first step in this process, but it’s not the end of the story.

Imagine you are the manager of your company’s core ML team. Given the excitement around deep learning, you’ve encouraged one of your engineers to start experimenting with it. After a few weeks of exploration, initial results suggest that a deep network leads to significant improvement over what your traditional ML methods have been able to achieve. Awesome!

Given these promising findings, you move quickly to scale up your investment: you hire more DL engineers and buy your own GPU cluster. (Why go on-prem? Read our post on why we see this as an emergent trend in deep learning.) Between getting your new recruits ramped up and setting up your newly purchased hardware, you’ve got a lot to do!

In the midst of this transition, it can be hard to think about scaling your ML software infrastructure. Furthermore, you might not see the need—Keras and PyTorch worked just fine for your initial experiments. What you don’t realize, however, is that scaling your DL cluster will introduce entirely new kinds of challenges. Here’s one:

How do you share your new GPU hardware among a growing team of DL engineers?

Surprisingly, even sophisticated teams we talk to often adopt quite low-tech solutions to this challenge, such as

  • Fixed schedule (e.g., Anne can use GPU box 1 on Mondays, Michael can use it on Tuesdays);
  • Dedicated GPU assignment (e.g., Anne gets GPU box 1 and Michael gets GPU box 2); or
  • Calendar signup (e.g., a spreadsheet where Anne and Michael reserve time on the cluster based on their needs).

Unfortunately, while these manual systems are easy to set up, they are also

  • Bad for cluster utilization: Restricting individual usage (to certain hours of the day or specific GPUs) limits your team’s ability to keep its hardware busy running experiments.
  • Inflexible: Suppose you want to use all of your GPUs to solve a single complex problem (e.g., distributed training or batch offline scoring of a large data set), or the GPU needs of different engineers change over time. A fixed schedule or static assignment will need to be manually adjusted.
  • Insecure: It is easy for people to accidentally schedule work on the same GPUs, often with disastrous (and hard to diagnose) effects, such as jobs silently dying or getting corrupted.
  • Difficult to Scale: As you buy more hardware and hire more engineers, manual allocation becomes infeasible. You’ll find yourself sinking more and more time into administrative overhead.

Traditional Cluster Management Software

Fortunately, the challenge of running heterogeneous workloads on a cluster of shared compute resources is not a new one. Cluster management software such as Kubernetes, DC/OS (built on top of Apache Mesos), and Slurm allow you to treat a collection of machines as a unified pool of hardware resources. This makes running workloads on your cluster dramatically easier: users only need to think about launching new containers, not about managing individual machines. Cluster management software often provides ancillary services that simplify distributed computing, such as fault tolerance, networking, and security. Historically, these frameworks have focused on managing CPU, memory, and disk resources, but recently all three frameworks have been updated to include basic support for GPU resources.

Spinning up a new cluster manager can be a little involved, especially if no one on your team has a systems background. However, compared to the ongoing headaches of manual GPU management, you expect that adopting one of these systems will pay off very quickly, and so you decide to move forward with Kubernetes.

What’s Missing?

With your functioning Kubernetes cluster, you expect your resource management problems to be fully solved. The early signs are promising. Your engineers no longer have to SSH into a particular machine to start a job; they can instead submit a containerized deep learning training job to the cluster, specifying the number of GPUs they’ll need.

You sit back and wait for team productivity to skyrocket. But in subsequent weeks, you see only a modest uptick in the number of models trained. Why?

Unfortunately, while the generic design of cluster management software makes it incredibly powerful, it also makes it blind to the unique properties of deep learning workloads. As a result, it lacks native support for many of your team’s crucial needs. These include distributed training, experiment tracking, metadata management, and integrated hyperparameter tuning.

In our next post, we’ll describe these needs in more detail to better make the case for why a traditional cluster manager is an insufficient DL infrastructure solution. Then, we’ll outline how Determined provides exactly the missing pieces needed to take your team to the next level.