November 10, 2020
Kubeflow is a “machine learning toolkit for Kubernetes.” As a toolkit, Kubeflow is composed of many independently developed open source tools for tasks including training models, model serving, notebook hosting, workflow coordination, and more. Because some of these tasks overlap with Determined’s capabilities as a deep learning training platform — and since both Determined and Kubeflow can run natively on Kubernetes — we are often asked how Determined compares with Kubeflow, and which tool is the best choice for different parts of the ML workflow.
Determined is focused on model training and development. Determined’s features for training deep learning models, hyperparameter tuning, and experiment management are easier to use and offer better performance than the equivalent Kubeflow components.
For other tasks such as model serving and ML pipelines, Determined integrates seamlessly with Kubeflow components that support these activities. That means that using Determined and Kubeflow together is often a great approach!
In this post, we examine the situation in more detail. We start by comparing Kubeflow and Determined for the common task of training deep learning models, and then we turn to how each system can be used to implement end-to-end ML workflows.
Kubeflow can be confusing to navigate for the uninitiated. When you install Kubeflow, what you are actually installing is a collection of loosely coupled components. The diagram below describes some of the components in Kubeflow.
Most of these components were developed independently, employ varying design philosophies, and differ in both maturity and quality. Some are built to be used by ML Engineers getting models into production (e.g., Seldon Core, Kubeflow Pipelines), while others are built to be used by data scientists directly (e.g., Feast, Notebooks, TFJob). If you are new to Kubeflow, we wouldn’t fault you for having a hard time sorting all of these pieces out and figuring out what to use.
Determined’s main focus is model training and development, providing tools to accelerate, improve, and track the training process. As such, it makes sense to first compare Determined with the model training components of Kubeflow. We can then look at how Determined fits into the broader Kubeflow ecosystem.
To understand how Determined fits in, let’s start by comparing Determined and Kubeflow’s model training functionality on a few key capabilities: distributed training, hyperparameter tuning, and experiment tracking.
Determined has equal or higher performance in distributed training, while being much easier for data scientists to configure and use.
Relevant Kubeflow Components: TFJob, PyTorchJob, MPIJob
|Performance: Determined implements best-in-class distributed training algorithms for you. We have made open-source contributions to Horovod that further improve its performance.||Performance: Depends on which component you use: best case you can use Horovod with MPIJob (which will include our contributions), but you could also use less efficient implementations in TFJob or PyTorchJob.|
|Ease of Use: The number of GPUs to use is configured with a single config setting, and no changes to your model code are required to use distributed training. No Kubernetes understanding is required to enable distributed training. Efficient fault-tolerance is provided automatically.||Ease of Use: Depending on which component you use, you will need to outfit your code with framework-specific distributed training code. You’ll then need to write a Kubernetes spec for training, which requires a good deal of understanding of Kubernetes. Model developers need to write custom code to ensure fault-tolerant training.|
Determined is built by world-experts in hyperparameter tuning, and features highly optimized implementations of state-of-the-art hyperparameter tuning algorithms. As a result, Determined’s support for hyperparameter tuning significantly outperforms Katib, while being much easier to use.
Relevant Kubeflow Components: Katib
|Performance: Determined uses the state-of-the-art ASHA algorithm, which was developed by Determined AI employees and achieves 100x better performance than random search and up to 10x better performance than comparable search algorithms.||Performance: Katib implements a range of search algorithms, from random search to Hyperband, a synchronous version of ASHA. Unfortunately their Hyperband implementation has multiple bugs and does not properly support early-stopping, substantially reducing its effectiveness.|
|Ease of Use: Hyperparameter search is configured with a simple configuration file that specifies hyperparameter ranges. Determined handles all of the complex orchestration, including checkpointing, tracking, and executing the algorithm.||Ease of Use: Katib requires you to parse hyperparameters from argparse, meaning you have to write a lot of custom code for each model you want to tune. You then need to specify a Kubernetes YAML spec for training, which requires a good deal of understanding of Kubernetes. Katib has no integrated artifact or checkpoint tracking.|
Determined automatically tracks metrics, parameters, checkpoints, and code versions. Kubeflow provides tools to manually track experiments and access artifacts.
Relevant Kubeflow Components: Kubeflow Metadata
|Capabilities: Determined automatically tracks and manages metadata and artifacts produced by model training. Determined also provides a lightweight model registry, allowing you to easily access the results of experiments and use them in production.||Capabilities: Kubeflow Metadata allows you to upload metadata and artifacts produced by model training. These artifacts can then be programmatically accessed later through the provided object storage.|
|Ease of Use: When you train a model in Determined, your experiment is automatically tracked and important artifacts are managed – no need to manually outfit your code with tracking capabilities. The APIs to access this metadata are built with production ML in mind, making it easy to use your trained models in production.||Ease of Use: In order to log metadata in Kubeflow, you need to manually add a good deal of code to upload metadata to an endpoint. Once you do this, that metadata can be retrieved via APIs, however you’ll need to manually restore your model from source code and trained weights.|
Kubeflow has some really impressive components – in particular, Kubeflow Pipelines and some of the Kubeflow serving tools (like Seldon Core) are best-in-class. Surprisingly, Determined is often easier to integrate with these Kubeflow components than the Kubeflow training components like TFJob and Katib! Determined exposes clean, easy to use APIs, making it simple to kick off training runs from a Kubeflow Pipeline or serve a model in Seldon Core. Check out our example integrations with Kubeflow Pipelines and Seldon Core.
This means that using Determined and Kubeflow together can often be a great approach – Determined for best-in-class model training and development, along with Kubeflow for tasks like model serving and ML pipelines:
Determined can simply be dropped in to improve the tooling provided by Kubeflow. You’ll have access to easy to use, state-of-the-art model training and hyperparameter tuning, as well as all of the excellent components of Kubeflow that will enable more end-to-end workflows. Best of all, Determined and Kubeflow both run natively on Kubernetes, giving you a uniform way to manage your entire ML infrastructure stack.
Determined’s model training capabilities outclass what the Kubeflow model training components can offer. Determined then goes the extra mile to help you productionalize the models you train, by tracking your experiments and making them available for use in your production systems. We think you’ll be best served by using best-in-class systems that work well together, like Determined and Kubeflow Pipelines. Install Determined on Kubernetes today, and join our Slack community if you have any questions!