August 05, 2020
At Determined AI, we enable deep learning engineers to train better models more quickly and to focus on data science rather than managing infrastructure. One of the major pain points that we have observed in training models is the process of loading data. In a previous blog post, we described how
tf.data.Dataset’s focus on sequential rather than random access leads to challenges supporting common deep learning tasks such as shuffling data, sharding data for distributed training, and efficiently restoring workloads after failures.
Today, we are excited to announce that we are open-sourcing YogaDL under the Apache 2.0 license. YogaDL provides a better approach to data loading and API-transparent caching to local storage, AWS S3, and Google Cloud Storage.
YogaDL is designed to be two things: a standalone caching layer to imbue existing data loaders with the properties that come from a random-access layer, and a better interface for defining data loaders in general.
YogaDL provides both a random-access layer and a sequential-access layer. As we argued recently, supporting efficient random access is critical for good training infrastructure. Direct random access to any record enables:
The sequential-access layer enables:
YogaDL enables random access by caching datasets. A dataset is cached by iterating over it before the start of training and storing the output to an LMDB file. The caching, which can be done on a local file system, S3, or GCS, enables random access, dataset versioning, and efficient data access.
Once the dataset is cached, YogaDL provides a random-access layer followed by a sequential-access layer. It does this by introducing the
YogaDL.DataRef interface, which creates an explicit boundary between the random- and sequential-access layers.
Currently, YogaDL accepts
tf.data.Dataset as an input and returns a
YogaDL.Stream, which can output either a
tf.data.Dataset or a Python generator. Support for additional data frameworks, such as tf.keras sequences and PyTorch DataLoaders, is on our near-term roadmap.
import yogadl import tensorflow as tf # Initialize YogaDL Storage. config = yogadl.storage.LFSConfigurations( storage_dir_path="/tmp/yogadl_cache" ) storage = yogadl.storage.LFSStorage(config) # Cache the dataset. @storage.cacheable(dataset_id="example-dataset", dataset_version="1") def make_records(): # Dataset that gets cached. return tf.data.Dataset.range(10) # Create the random-access layer. dataref = make_records() # Create the sequential-access layer. stream = dataref.stream( start_offset=3, shuffle=True, shuffle_seed=23, shard_rank=2, num_shards=4 ) # Convert to a tf.data.Dataset(). tf_dataset = yogadl.tensorflow.make_tf_dataset(stream) # Read the dataset. batches = tf_dataset.repeat(3).batch(5) for batch in batches: print(batch)
This code snippet shows how YogaDL can be used with
tf.data.Dataset. Compared to using just a
tf.data.Dataset, YogaDL enables users to:
tf.data.Dataset, to shard the dataset among workers to perform distributed training, oftentimes every worker needs to iterate through the entire dataset. With YogaDL, every worker reads only the data associated with its shard, which is much more efficient.
YogaDL is not a data manipulation API: the world has more than enough of those. Instead, YogaDL seeks to be API-transparent, so that you can continue to use your existing data loading code but with all the benefits of a high-performance, random-access cache. If you have data augmentation steps which cannot be cached, that code should continue to work without any modifications.
YogaDL offers basic dataset versioning, but it is not currently a full-blown version control for datasets. Offering something like version control for datasets is on the roadmap as well.
In addition to offering YogaDL as a standalone library, we have integrated it into our open-source training platform, Determined. To get started with using YogaDL in Determined, take a look at our documentation and examples.