Easier NLP training with Metaflow and Determined

By Hoang Phan, Savin Goyal

May 13, 2021

Introduction

In this blog post, we’ll discuss the benefits of using Metaflow and Determined to manage the workflow and training of your deep learning models.

Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. Currently, Metaflow supports distributed workloads with AWS Batch, which works well for general-purpose tasks but it is not optimized for training deep learning models. To fill this gap, today we’d like to introduce another option to develop deep learning models by training with Determined and Metaflow.

Determined and Metaflow

Determined is an open-source deep learning training platform that aims to simplify the somewhat complex infrastructure required to run deep learning training so users can focus on building models. Combined with Metaflow, Determined can be used as part of your end-to-end data science project, from data pre-processing to running inference, all managed under Metaflow’s intuitive workflow.

While you can use Determined and Metaflow individually for their respective capabilities, there are a few key benefits to launching a Determined experiment through a Metaflow workflow, including:

Integration with other services: Metaflow integrates with a host of other services, meaning you can run parallel data processing on AWS Batch, execute arbitrary Python code in a script, and train your model in Determined, all in the same Metaflow workflow without needing to set up your own separate integrations for each service.
Cohesive control flow for your data science pipeline: Since Metaflow manages logical control flow in your pipeline, you don’t need to write additional code around your model training. Deciding whether to promote your most recently trained model can be embedded as a step in your flow that simply checks the returned validation metric from Determined against a preset value and promotes the model if the metric is higher.
Experiment tracking for all your projects: Metaflow can centrally manage the logging, variables, and execution parameters across different services in your pipeline. This means that the outputted data stored in object-storage from one step can be used as the input data for model training in Determined, which can then produce a unique checkpoint ID that can be accessed in a Notebook running Metaflow downstream.

Use case

Let’s look at a real-world example where using Metaflow and Determined could make you more productive on your data science project.

You’re working on a deep learning project to fine-tune ALBERT, a language representation model that can be used in various Natural Language Processing (NLP) tasks, on a custom dataset. To start off, you’re considering training locally, but would like the option to eventually run multi-GPU training later since you know training this model will take a large amount of time.

One way you can train ALBERT is to start with the official repository - however, it’s not clear how to transition to multi-GPU or eventually multi-node training. Another option is to train ALBERT on Determined - where you can train locally first and later expand to distributed training with just a single configuration change (we’ll get to that later).

In this scenario, we’ll go the Determined route, with Metaflow as our overall workflow management, which gives us some nice benefits that we’ll cover as we go. The ALBERT model we’ll be using is based on this blog post (which covers training ALBERT faster and cheaper via distributed training and spot instances).

Setup

First, set up the environment to run the example:

1) Create and activate Python 3 virtual environment:

python3 -m venv determined
source determined/bin/activate

2) Clone the Determined repo:

git clone https://github.com/determined-ai/works-with-determined.git
cd works-with-determined/metaflow
pip install -r requirements.txt

3) Next, let’s start Determined locally:

det deploy local cluster-up # optional `--no-gpu` flag to only use CPUs

More detailed installation instructions can be found here.

Once Determined is started, you can go to localhost:8080 to view the Determined UI and confirm it’s running on your local machine.

And you can run the Metaflow Flow with:

example-determined.py run --det-master=localhost:8080

Let’s look at what this Flow is doing:

example-determined.py contains the Flow and can be executed from the command line with Python. It can also take arguments that are used to configure the rest of the Flow, in this case --det-master=localhost:8080, which sets your training to your local Determined installation (defaulted to localhost:8080).
Within the Flow, we are starting an experiment (Determined lingo for workload) which uses the code in the ALBERT subdirectory. The standard command to do this is through the command line listed below, but this is executed as part of the Flow for you as a Python subprocess:

det experiment create <config-file.yaml> <model-dir/>

Metaflow then waits for the Determined experiment to complete, and reports back the best metric from the training via the log output. If you cancel the process during this time, Metaflow will cancel the experiment training on Determined as well.

Benefit: During this whole process, Metaflow is tracking logs and key variables that can be retrieved later for downstream tasks (e.g. inference).

Depending on the metric, Metaflow decides whether to label the run as “pass” or “fail”, which influences whether we use the checkpoints for the model during inference.

Benefit: You can easily embed condition logic in your Flow, which allows you to control when certain tasks are executed as part of your pipeline.

The training will take about 10 minutes on CPU and 3 minutes on GPU - so this is a good time for a quick coffee break!

Local Inference via Jupyter Notebook

With the model trained and the Flow complete, we can now see how well the model performs. Let’s start Jupyter and open the notebook in the repository (local-inference.ipynb).

jupyter notebook

In the notebook, we first pull the latest successful run from Metaflow using the Metaflow Client API, and using this run we can retrieve embedded variables (det_master, expriment_id) that will be used to identify and retrieve the experiment on Determined.

Benefit: Metaflow provides access to key variables throughout your Flow, regardless of which environment you’re accessing it in, allowing you more universal management across your stack.

master_url = run.data.det_master

experiment_id = run.data.experiment_id

Using the Determined Checkpoint API, we load the best checkpoint from our training…

checkpoint = Determined(master=master_url). \

get_experiment(experiment_id). \

top_checkpoint()

model = checkpoint.load(map_location=torch.device('cpu'))

…and use it to run inference against a sample Question and Answer set. The results are promising!

Q: "How many people live in New Zealand?"

A: “4.9 million”

Of course, model training is non-deterministic and the specific results will vary. Try increasing searcher.max_length.records in the .yaml configuration file if you’re not getting good results on the first run!

Scaling up: Multi-{GPU/Node} Distributed Training

Now that you’re seeing some good results from training, you’re ready to scale up to train on a larger dataset, and ideally in a reasonable amount of time. Typically this is done by distributing the dataset across multiple GPUs (and potentially machines), and training with a portion of the data on each GPU – a paradigm called data parallel distributed training.

From the configuration side, going from training on 1 GPU locally, to 8 GPUs (or, really, any number of GPUs) is as simple as a one-line configuration change in the Determined configuration file:

slots_per_trial: 8

You probably don’t have an 8 GPU Determined cluster up and running this exact moment, but deploying one is straightforward on AWS or GCP…

det deploy <aws/gcp> up --cluster-id <your-cluster-name>

…with additional flags depending on which cloud and which configurations you want. Check out the full details for AWS and GCP.

As an alternative to cloud-based GPU clusters, which can get expensive, you can run Determined locally on an on-premise GPU cluster. You can consider a cost-effective hybrid model where general-purpose compute is executed in the cloud e.g. using Metaflow’s @batch decorator while model training happens on an on-premise GPU cluster that can be shared by multiple teams.

The nice thing is you can use the same process to train your model on this new cluster (or any Determined cluster) as you did locally, simply by directing the workload to the IP of the cluster.

python example-determined.py run --det-master <master_url> --config-file distributed.yaml

You’ll notice we are now using a distributed.yaml configuration, which has updated the slots_per_trial argument to 8. If you’re curious, feel free to diff the local.yaml and distributed.yaml files in this directory – you’ll notice the changes are all hyperparameters (the science) with slots_per_trial being the main infrastructure change.

And that’s it! You’ve gone from fine-tuning ALBERT locally to executing data parallel distributed training on a remote cluster in a few configuration changes and no model code change!

Additional Advanced Functionality:

There’s a ton of other things you can do with Determined, but as a final note we’ll highlight a few we think are really cool:

Hyperparameter Search: you can use state of the art hyperparameter search algorithms on Determined by simply changing your hyperparameters in your configuration file from constant values to ranges. Try using --config-file hyper.yaml to see it in action.
Spot / Preemptible Instances: on AWS and GCP, you can set up your Determined cluster with Resource Pools, which can leverage significantly less expensive spot / preemptible instances (in fact, the ALBERT example we use in this example was written to show just that – see more details in the original blog post here). Since every experiment that runs on Determined is fault-tolerant, you won’t lose your training progress if your instances get preempted – it’ll just continue training where you left off once the spot / preemptible instances are available again.

Conclusion

We’ve shown that Metaflow and Determined can be used to great effect by leveraging Metaflow to orchestrate the overall data science workflow and using Determined for the parts that require training of large-scale models.

Below are links to the repository for this example and how you can keep in touch with the Metaflow and Determined teams:

GitHub: Works With Determined

If you have any trouble following the example shown above or you have questions about Metaflow or Determined, don’t hesitate to reach out to us on the following channels:

Easier NLP training with Metaflow and Determined

Introduction

Use case

Setup

Let’s look at what this Flow is doing:

Local Inference via Jupyter Notebook

Scaling up: Multi-{GPU/Node} Distributed Training

Additional Advanced Functionality:

Conclusion

Recent Posts

Finding the best LoRA parameters

Summer '24 Conference Recap

How does Video Generation work?