November 29, 2022
Welcome to Determined!
If you’re already an existing Determined user, some of this information might be old news. We wanted to write an in introductory guide to new users who haven’t yet taken advantage of Determined’s feature set in their machine learning workflow. Enjoy!
Determined is an open source deep learning training platform that makes building models fast and easy. To get started, you’ll need to know some core concepts, and learn how to port your model to the Determined platform using one of our APIs in order to take advantage of distributed training. In this guide we’ll cover an abridged version of some important core concepts, point to basic setup instructions and include links to more detailed information on our website, as well as cover how to get started with our Trial APIs and Core API (docs).
Determined has a master and one or more agents*. Together these are termed a Determined cluster. The Determined agents run workloads (e.g. your model training experiment), while the master stores metadata, dispatches work to agents, and hosts the WebUI. Details can be found here.
*Determined has a master and one or more agents in the most common configurations. Installations on top of k8s or slurm will look different.
An experiment represents the basic unit of running the model training code. When you submit a job to the Determined cluster, whether you are running multiple trials with different hyperparameter search values, or just a single trial with set values, you are running a Determined experiment.
To run experiments, you need to implement one of the Training APIs (covered below).
More details here.
To configure any model to run on the Determined cluster in the form of Experiments, you need a config file, which is necessary to specify parameters such as hyperparameter search settings, hyperparameters themselves, and distributed training settings. This is in a YAML format and is passed to the Determined cluster when launching the workload via the Determined CLI. Examples and more details specific to each type of API will be covered in the Training API section.
More details here.
The Determined master can either be your local computer or a cluster (AWS, GCP) somewhere else. For a hackathon, if you don’t have access to a powerful on-prem hardware cluster (only your laptop GPU, for example) you’d probably be best served with an AWS or GCP cluster – train times on these would be much faster.
The command line client defaults to looking for the master on localhost, but can be configured to look elsewhere using the
DET_MASTER environment variable. Details be found here. (This page also includes more in-depth information regarding cluster setup, master configuration, and firewalls/proxies, if any of this applies to your system).
To access the WebUI on your local host, navigate to https://localhost:8080
The Determined CLI is used to interact with the master. Installation instructions for the CLI can be found here.
Now on to the fun stuff!
We have several Trial APIs specific to PyTorch, Keras, and Tensorflow-based libraries (full list here), as well as our Core API, which works differently from a user perspective. Which API best suits your needs depends on your code configuration, how much flexibility you want, and what features you are looking for Determined to provide.
To use Trial APIs, users convert their existing training code by subclassing a Trial class and implementing methods that expose components of the user’s model - e.g., model architecture, data loader, optimizer, learning rate scheduler, callbacks, etc – to the Determined master. Essentially, you restructure your code to fit a template such as the following:
This is called the Trial definition, and by structuring your code in this way, Determined is able to run the training loop and provide advanced training and model management capabilities. You no longer have control over the training loop, but metric reporting, checkpointing, etc. on the Determined cluster is seamless.
The best demo of the Trial APIs is here. This tutorial shows you how the Trial APIs work, what a YAML config file consists of, and how the experiment looks on the WebUI when successful.
A more detailed walkthrough for the Trial APIs is here, and a porting checklist to ensure your code does not conflict with the Determined library is here.
The Core API was developed to give users the flexibility to keep their custom training loops and train arbitrary models easily on Determined, taking away the constraints posed by Trial definitions. Trial APIs are implemented on top of the Core API: the core functionality of metrics tracking, checkpoint tracking and preemption support, hyperparameter search, and distributing training are provided via Core API. Using the Core API directly gives you back direct control over all these features, but they have to be configured manually rather than being available out-of-the-box. This guide covers how to get started with Core API using a simple script. This blog post shows an example of porting a real-world research script to Core API.
Hopefully you found this post informative! As you’re getting started with using Determined, please feel free to ask questions in the Community Slack channel.