May 13, 2021
In this blog post, we’ll discuss the benefits of using Metaflow and Determined to manage the workflow and training of your deep learning models.
Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. Currently, Metaflow supports distributed workloads with AWS Batch, which works well for general-purpose tasks but it is not optimized for training deep learning models. To fill this gap, today we’d like to introduce another option to develop deep learning models by training with Determined and Metaflow.
Determined is an open-source deep learning training platform that aims to simplify the somewhat complex infrastructure required to run deep learning training so users can focus on building models. Combined with Metaflow, Determined can be used as part of your end-to-end data science project, from data pre-processing to running inference, all managed under Metaflow’s intuitive workflow.
While you can use Determined and Metaflow individually for their respective capabilities, there are a few key benefits to launching a Determined experiment through a Metaflow workflow, including:
Let’s look at a real-world example where using Metaflow and Determined could make you more productive on your data science project.
You’re working on a deep learning project to fine-tune ALBERT, a language representation model that can be used in various Natural Language Processing (NLP) tasks, on a custom dataset. To start off, you’re considering training locally, but would like the option to eventually run multi-GPU training later since you know training this model will take a large amount of time.
One way you can train ALBERT is to start with the official repository - however, it’s not clear how to transition to multi-GPU or eventually multi-node training. Another option is to train ALBERT on Determined - where you can train locally first and later expand to distributed training with just a single configuration change (we’ll get to that later).
In this scenario, we’ll go the Determined route, with Metaflow as our overall workflow management, which gives us some nice benefits that we’ll cover as we go. The ALBERT model we’ll be using is based on this blog post (which covers training ALBERT faster and cheaper via distributed training and spot instances).
First, set up the environment to run the example:
1) Create and activate Python 3 virtual environment:
python3 -m venv determined source determined/bin/activate
2) Clone the Determined repo:
git clone https://github.com/determined-ai/works-with-determined.git cd works-with-determined/metaflow pip install -r requirements.txt
3) Next, let’s start Determined locally:
det deploy local cluster-up # optional `--no-gpu` flag to only use CPUs
More detailed installation instructions can be found here.
Once Determined is started, you can go to
localhost:8080 to view the Determined UI and confirm it’s running on your local machine.
And you can run the Metaflow Flow with:
example-determined.py run --det-master=localhost:8080
example-determined.py contains the Flow and can be executed from the command line with Python. It can also take arguments that are used to configure the rest of the Flow, in this case
--det-master=localhost:8080, which sets your training to your local Determined installation (defaulted to
Within the Flow, we are starting an experiment (Determined lingo for workload) which uses the code in the ALBERT subdirectory. The standard command to do this is through the command line listed below, but this is executed as part of the Flow for you as a Python subprocess:
det experiment create <config-file.yaml> <model-dir/>
Metaflow then waits for the Determined experiment to complete, and reports back the best metric from the training via the log output. If you cancel the process during this time, Metaflow will cancel the experiment training on Determined as well.
Benefit: During this whole process, Metaflow is tracking logs and key variables that can be retrieved later for downstream tasks (e.g. inference).
Depending on the metric, Metaflow decides whether to label the run as “pass” or “fail”, which influences whether we use the checkpoints for the model during inference.
Benefit: You can easily embed condition logic in your Flow, which allows you to control when certain tasks are executed as part of your pipeline.
The training will take about 10 minutes on CPU and 3 minutes on GPU - so this is a good time for a quick coffee break!
With the model trained and the Flow complete, we can now see how well the model performs. Let’s start Jupyter and open the notebook in the repository (
In the notebook, we first pull the latest successful run from Metaflow using the Metaflow Client API, and using this run we can retrieve embedded variables (
expriment_id) that will be used to identify and retrieve the experiment on Determined.
Benefit: Metaflow provides access to key variables throughout your Flow, regardless of which environment you’re accessing it in, allowing you more universal management across your stack.
master_url = run.data.det_master experiment_id = run.data.experiment_id
Using the Determined Checkpoint API, we load the best checkpoint from our training…
checkpoint = Determined(master=master_url). \ get_experiment(experiment_id). \ top_checkpoint() model = checkpoint.load(map_location=torch.device('cpu'))
…and use it to run inference against a sample Question and Answer set. The results are promising!
Q: "How many people live in New Zealand?" A: “4.9 million”
Of course, model training is non-deterministic and the specific results will vary. Try increasing
searcher.max_length.records in the
.yaml configuration file if you’re not getting good results on the first run!
Now that you’re seeing some good results from training, you’re ready to scale up to train on a larger dataset, and ideally in a reasonable amount of time. Typically this is done by distributing the dataset across multiple GPUs (and potentially machines), and training with a portion of the data on each GPU – a paradigm called data parallel distributed training.
From the configuration side, going from training on 1 GPU locally, to 8 GPUs (or, really, any number of GPUs) is as simple as a one-line configuration change in the Determined configuration file:
You probably don’t have an 8 GPU Determined cluster up and running this exact moment, but deploying one is straightforward on AWS or GCP…
det deploy <aws/gcp> up --cluster-id <your-cluster-name>
As an alternative to cloud-based GPU clusters, which can get expensive, you can run Determined locally on an on-premise GPU cluster. You can consider a cost-effective hybrid model where general-purpose compute is executed in the cloud e.g. using Metaflow’s
@batch decorator while model training happens on an on-premise GPU cluster that can be shared by multiple teams.
The nice thing is you can use the same process to train your model on this new cluster (or any Determined cluster) as you did locally, simply by directing the workload to the IP of the cluster.
python example-determined.py run --det-master <master_url> --config-file distributed.yaml
You’ll notice we are now using a
distributed.yaml configuration, which has updated the
slots_per_trial argument to
8. If you’re curious, feel free to diff the
distributed.yaml files in this directory – you’ll notice the changes are all hyperparameters (the science) with
slots_per_trial being the main infrastructure change.
And that’s it! You’ve gone from fine-tuning ALBERT locally to executing data parallel distributed training on a remote cluster in a few configuration changes and no model code change!
There’s a ton of other things you can do with Determined, but as a final note we’ll highlight a few we think are really cool:
--config-file hyper.yamlto see it in action.
We’ve shown that Metaflow and Determined can be used to great effect by leveraging Metaflow to orchestrate the overall data science workflow and using Determined for the parts that require training of large-scale models.
Below are links to the repository for this example and how you can keep in touch with the Metaflow and Determined teams:
If you have any trouble following the example shown above or you have questions about Metaflow or Determined, don’t hesitate to reach out to us on the following channels: