6 Determined APIs Explained

By Kevin Musgrave

October 16, 2023

The Determined library has multiple APIs, each one suited for a different task or style of coding. In this blog post, we’ll provide a brief overview of 6 of these APIs, and when and why to use them.

Trial API

The Trial API provides a structured way to write training and evaluation loops. For example, here is the structure of a PyTorchTrial:

from determined.pytorch import DataLoader, PyTorchTrial, PyTorchTrialContext

class MyTrial(PyTorchTrial):
    def __init__(self, context: PyTorchTrialContext) -> None:
        self.context = context

    def build_training_data_loader(self) -> DataLoader:
        return DataLoader()

    def build_validation_data_loader(self) -> DataLoader:
        return DataLoader()

    def train_batch(self, batch: TorchData, epoch_idx: int, batch_idx: int)  -> Dict[str, Any]:
        return {}

    def evaluate_batch(self, batch: TorchData) -> Dict[str, Any]:
        return {}

With this API, all of your training logic goes in train_batch, all of your evaluation logic goes in evaluate_batch, and so on. This can be a great way to organize your code, and it’s very handy if you’re using simple training algorithms, or just getting started with deep learning in general. Of course, it also gives you access to all of Determined’s features like distributed training and hyperparameter tuning.

In addition to the PyTorchTrial for the PyTorch library, Determined offers KerasTrial and DeepSpeedTrial which are for the Keras and DeepSpeed libraries respectively.

Core API

But what if you have a complex training algorithm that doesn’t fit neatly into the structure of the Trial APIs? Or what if you want to use Determined with your existing codebase, with minimal changes?

This is the purpose of the Core API, which gives you low-level, granular functions. The Core API functions are accessed via the core context:

import determined as det
with det.core.init() as core_context:
    # use core_context in here

The core_context provides all the functions you need for:

Reporting metrics

# training metrics
core_context.train.report_training_metrics(
    steps_completed=steps_completed,
    metrics={"train_loss": train_loss},
)

# validation metrics
core_context.train.report_validation_metrics(
    steps_completed=steps_completed,
    metrics={"test_loss": test_loss},
)

Checkpointing

# saving checkpoint
metadata = {"steps_completed": steps_completed}
with core_context.checkpoint.store_path(metadata) as (path, storage_id):
    torch.save(model.state_dict(), path / "checkpoint.pt")

# restoring checkpoint
latest_checkpoint = det.get_cluster_info().latest_checkpoint
with core_context.checkpoint.restore_path(latest_checkpoint) as path:
    with pathlib.Path(path).joinpath("checkpoint.pt").open("rb") as f:
        model = torch.load(f)

Hyperparameter Tuning

# get a dictionary of the current trial's hyperparameters
hparams = det.get_cluster_info().trial.hparams

# for example, optimizer learning rate as a hyperparameter
optimizer = optim.Adadelta(model.parameters(), lr=hparams["lr"])

Distributed Training

distributed = det.core.DistributedContext.from_torch_distributed()
with det.core.init(distributed=distributed) as core_context:
    # use core_context in here

The Core API gives you the freedom to insert Determined functionality into your code where you see fit.

PyTorch Trainer

It’s usually the case that you want to debug and iterate on your code locally, before sending your code to a GPU cluster. The PyTorch Trainer class is built for this purpose:

with det.pytorch.init(hparams=hparams, exp_conf=exp_conf) as train_context:
    trial = MyTrial(train_context)
    trainer = det.pytorch.Trainer(trial, train_context)
    trainer.fit(max_length=det.pytorch.Epoch(1))

What’s great about this approach is that you can debug your code locally (python train.py), then run it on your Determined cluster (det experiment create), without making any changes.

Batch Processing API

Want to efficiently evaluate your model or generate LLM embeddings for retrieval augmented generation? Use the Batch Processing API! This API takes care of:

distributing inference across multiple GPUs
pausing and resuming inference

Plus, you can report any inference-related metrics and view them in the Determined Web UI.

To use the Batch Processing API, first extend the TorchBatchProcessor class and implement the process_batch function:

from determined.pytorch import experimental

class EmbeddingProcessor(experimental.TorchBatchProcessor):
    def __init__(self, context):
        self.context = context
        self.model = context.prepare_model_for_inference(get_model())

    def process_batch(self, batch, batch_idx) -> None:
        predictions = self.model(batch)

Then pass in your class, dataset, and other arguments into the torch_batch_process function:

experimental.torch_batch_process(
    EmbeddingProcessor,
    dataset,
    batch_size=64,
    checkpoint_interval=10,
)

Detached Mode

If you’re only in need of a metrics-reporting solution, or just want to get familiar with Determined’s experiment tracking and visualization features, then Detached Mode is for you. Detached Mode let’s you report metrics and visualize them in the Web UI, without using Determined to manage your training job.

Detached Mode uses functions from a new version of the Core API, so its functions look quite similar to the Core API examples shown earlier. For example, here is how metrics are reported:

from determined.experimental import core_v2

# initialize
core_v2.init(
    defaults=core_v2.DefaultConfig(
        name="detached_mode_example",
    ),
)

# report metrics
core_v2.train.report_validation_metrics(
    steps_completed=steps_completed, metrics={"loss": loss}
)

Python SDK

Do you want to programmatically manage your experiments? Do you like Python?

Then you’ll love the Python SDK, which allows you to create and organize experiments, download model checkpoints, retrieve trial metrics, and more. The SDK has many of the same capabilities as the Determined CLI, but you get to write your logic in Python.

Here are some snippets to give you an idea of what the SDK looks like:

from determined.experimental import client

# create a workspace
client.create_workspace(workspace_name)

# create an experiment
exp = client.create_experiment(config=config, model_dir=model_dir)

# retrieve checkpoint metadata
checkpoint = exp.list_checkpoints(
    max_results=1,
    sort_by=client.CheckpointSortBy.SEARCHER_METRIC,
    order_by=client.OrderBy.DESCENDING,
)[0]

# download checkpoint
checkpoint.download()

# and much more...

Summary

Here’s a summary of the 6 APIs, with links to the documentation:

API	Key Use Cases
Trial API	Structured code for simple training loops. Easy to start with.
Core API	Granular control for advanced training algorithms. Integrates well into existing codebases.
PyTorch Trainer	Local debugging and iteration.
Batch Processing API	Efficient inference.
Detached Mode	For logging and visualizing metrics only.
Python SDK	Experiment creation, organization, and retrieval, using Python.

In addition to our documentation, you can read our more in-depth blog posts:

If you have any questions, please ask us on GitHub or in our Slack community!