September 24, 2020
Determined is an open-source deep learning training platform that makes building models fast and easy. We believe deep learning practitioners should have access to a great model development environment that lets them focus on deep learning, regardless of whether they are using on-demand cloud VMs, spot or preemptible instances, or an on-premise GPU cluster.
Today, we’re excited to announce that we’re expanding this vision by adding native support for Kubernetes to the Determined training platform!
Deep learning requires massive computational resources. As your deep learning efforts go beyond a single GPU, a cluster scheduler is essential because it allows you to utilize all of your GPUs on the deep learning tasks that are the most important at any given time. Realizing this, Determined has always included a container scheduler; moreover, Determined’s scheduler is designed to support the unique requirements of common deep learning workloads like distributed training and hyperparameter tuning.
When we set out to build Determined more than three years ago, we found that most teams applying deep learning were not using Kubernetes to manage their GPUs! While in some cases they had a Kubernetes cluster somewhere in their organization, their GPU resources were often managed separately — in part because support for GPUs in Kubernetes was still in alpha at the time. Adopting Kubernetes also requires dealing with a fair bit of complexity, which can be premature for small deep learning clusters with a handful of multi-GPU nodes. Based on those considerations, we built our own cluster scheduler. That scheduler is in active production use today, managing thousands of GPUs on AWS, GCP, and on-premise clusters.
Infrastructure for deep learning is increasingly moving from an R&D project to a core part of the production environment at many companies. Over time, GPU support in Kubernetes has matured and we’ve seen more interest from customers and community members about using Determined with Kubernetes-managed resources. For platform teams that are comfortable using Kubernetes to manage their infrastructure, there are significant wins from being able to manage and monitor deep learning workloads using the same tools they use for other compute tasks.
Determined on Kubernetes works by scheduling Determined workloads, such as model training and hyperparameter tuning jobs, as a collection of Kubernetes pods. That means that native Kubernetes tools for logging, metrics, and tracing will work as expected. Determined can be easily installed onto a Kubernetes cluster using Helm, and is compatible with managed Kubernetes offerings like Google Kubernetes Engine (GKE) and AWS Elastic Kubernetes Service (EKS). Using GKE or EKS with Determined means you can use Kubernetes’ cluster autoscaler to dynamically scale your compute resources as new deep learning jobs are submitted.
All of Determined’s features work out-of-the-box on Kubernetes (with a few
caveats noted below). Just as important, using Determined with Kubernetes
doesn’t mean imposing complexity on deep learning engineers: users who just
want to develop DL models don’t need to worry about the details of the
underlying infrastructure to use Determined on Kubernetes. With Determined,
launching a new workload doesn’t require writing Kubernetes YAML files or
kubectl, and DL engineers can understand the progress and performance
of their training tasks using Determined’s intuitive WebUI, rather than needing
to track the status of a collection of pods or containers.
For those users who would like to customize the behavior of the pods used to run their deep learning workloads, a custom pod spec can be configured on either a per-cluster or per-task level. This feature makes it easy to support requirements like assigning tasks to specific nodes or attaching additional volume mounts. Custom pod specs can also be used to seamlessly run workloads on spot instances on AWS! Determined’s built-in fault tolerance capabilities ensure that any pods that are terminated abruptly will be restarted without disrupting the progress of any deep learning training jobs.
Determined is a deep learning training platform: we’re focused on making it simple to develop and train high-quality models in less time. Determined is also designed to integrate smoothly with best-of-breed tools for standard ML tasks such as Pachyderm for data preparation, Kubeflow Pipelines for ML workflows, and Seldon Core for model serving. If you’re already using Kubernetes to manage the other ML tools in your stack, Determined on Kubernetes enables your entire deep learning infrastructure to be managed as a unified whole, simplifying operations and increasing agility.
In a recent blog post, we showed how Determined can be used with Kubeflow Pipelines and Seldon, and the works-with-determined repository showcases many more examples of how Determined can be integrated with other popular open-source ML tools. If you’d like to see how Determined can work with your favorite ML tools, get in touch with us!
Support for Kubernetes is included in Determined 0.13.2 and is compatible with Kubernetes >= 1.15.
Although most of Determined’s functionality works seamlessly on Kubernetes, a few features are currently not supported. In particular, Determined on Kubernetes does not currently support the scheduling policies that are available when deploying Determined on VMs, including priority scheduling, fair sharing of resources across experiments, and gang-scheduling for distributed deep learning jobs. Determined relies on Kubernetes to handle pod scheduling and the default Kubernetes scheduler does not support these policies. We have plans to address this in the future!
Determined on Kubernetes can be installed via Helm; for more details, check out the installation instructions. For general information about using Determined with Kubernetes, check out the documentation.
We would love your feedback on Determined in general, and feedback on our support for Kubernetes would be particularly welcome! If you run into any issues or have suggestions, we’d love to hear from you! Please file an issue on GitHub or join our Slack community.