Software Engineer, Systems

About this role

As a Software Engineer focused on Systems, you will tackle challenging problems at the cutting edge of deep learning research and development, and collaborate with leading machine learning engineers. You will have the opportunity (and responsibility!) to define major aspects of our product: you’ll be expected to take on a difficult problem without a clear solution, and to design, build, and iterate until we’ve reached an elegant solution that delights our customers. You will work on problems such as efficient cluster scheduling over heterogeneous GPUs, implementing cutting-edge algorithms for hyperparameter optimization, and designing systems for managing ETL pipelines and automated deployment of deep models.

Requirements

  • Strong problem solving and analytical skills
  • Excellent communication skills, both written and verbal
  • An exceptional track record of designing, implementing and shipping scalable, reliable production-quality software
  • Experience with distributed and/or concurrent software development
  • Familiarity with machine learning is NOT required

Preferred

  • M. Sc. or PhD in Computer Science, or equivalent industry experience
  • Experience building systems for large-scale data management, analytics, cluster scheduling, stream processing, or machine learning
  • Familiarity with modern container-based cluster managers (e.g., Kubernetes, DC/OS)
  • Experience doing operations and being on-call for production systems
  • Familiarity with hardware performance, HPC and/or scientific computing

Possible projects

  • Implement state of the art communication strategies for distributed training of deep learning models.
  • Research and implement novel resource-aware model optimization strategies to help customers deploy model to resource-constrained environments
  • Work with customers to understand their workloads, help them find improved performance using our platform, and champion and implement new product features to improve their experience
  • Explore interesting new data visualizations in our web UI to help customers understand their experiments and workloads.

Teams & Process

We are building a team of world class engineers — join us! We have one product and one team, where everyone is a worker-leader. We combine input from customers, engineers and company leadership to prioritize our work, and work hard to make decisions transparent. We believe in tight feedback with customers, and in minimum valuable products.

We believe in just enough (but not too much) process; currently we run scrum with two week sprints. We use Github to manage our work; we require code review, lint, and tests to pass for all our PRs. We run an extensive continuous integration pipeline to test our GPU features. We use Slack, GSuite and have provisioned a video conferencing system for our remote workers.

Technical Challenges

We have implemented, from scratch, a distributed, fault tolerant GPU cluster manager and scheduler, purpose-built for DL and ML workloads. We have invented, published and implemented state-of-the-art hyperparameter optimization algorithms in our platform. We have numerous other research ideas ready to turn into product features that will differentiate us from our competitors.

Technical Stack

    Go

    Python

    Docker

    TensorFlow

    PyTorch

    Keras

    Elm

    Kubernetes

    Mesos

    PostgreSQL