Tackling GPU Scarcity & Complexity Head-On

In recent discussions about cloud-based GPU workloads, I was struck by these two recurring challenges:

Organizations are still having trouble getting GPU's
The barrier to entry on running GPU enabled workloads oftentimes has to do with lack of expertise on deployment.

As the Head of Customer Success for a platform that confronts both of these problems, I wanted to take a moment to talk about their origin, what teams can do to mitigate, and finish up with a brief look at how Cycle might help.

A Problem Faced Today

One major issue with running GPU based workloads is availability of GPU powered infrastructure. Now that might not seem like an issue in a global sense. There are plenty of IaaS providers that have GPU infrastructure as part of their SKU, so what could possibly be the issue?

Imagine this scenario. You've deployed to AWS and you're running a multi AZ setup in a single region. You need a p2.xlarge or p3.2xlarge (the smallest machines on the current SKU), so your queuing program queues up the work to be done and the rest is history…. Right?

Turns out, more and more frequently, those machines aren't available. As in AWS, in the given region, it is simply out of that type of machine. In an effort to figure out why your jobs never made it out of their queue, you inspect and find the information. Adjusting the parameter, you start allowing for the next size up for p3 which is 8xlarge. Given the ephemeral nature of the program you're running you have it set up use on-demand and at the end of the month you find that:

Only about 50% more of your scheduled work is getting done.
The price for 8xlarge is over 4x more than 2xlarge.

So you haven't solved your problem and you've gone over budget in one awesome swoop! Yay (sarcastically)...

The Other Elephant

Let's address the other major challenge. A single developer can only specialize in so many things and being able to deploy GPU powered workloads to a platform isn't always easy.

The reason this problem makes sense to talk about in the context of this article is due to the fact that, if you are able to set up a single provider to run the GPU workloads, having that fail, go over budget, or both can feel like a major roadblock. For specialized teams, it's a choice between hiring another team member to advance core work or an Ops expert to manage workloads more efficiently.

Now I'll never argue against the importance of ops. It's essential to every team's core strategy; understanding the how, when, why, and what of their production deployments. All I'm positing is the idea that when constrained against resource caps, working to push a project as fast and efficiently as possible, it's important to weigh the decision against the best possible outcomes.

Raising the Standard

Undoubtedly, numerous organizations strive to enhance the GPU workload experience. The real question is becoming, what platform will give our team the highest likelihood of success?

For Cycle users, deploying GPU infrastructure is the same process as deploying any other server (vm, bare metal, etc). The standardization doesn't stop there though, our users will also have a similar experience deploying their containers, scaling, migrating, configuring… the only noticeable difference is the need to set a couple of environment variables that tell the underlying platform that you expect the GPU driver to be mounted.

The other advantage Cycle users maintain is that Cycle is multi-cloud native. That means you can pick GPU powered devices from any of the supported providers that have them (currently GCP, AWS, and Vultr). Being multi-cloud native all but eliminates the chances that your workloads will go unprocessed and with automated provisioning, networking, and deployment (which can be further enhanced through the API), you won't need an expert on your team to get things online.

What's Next?

With several platforms in development, leveraging and extending Cycle to manage GPU-enabled workloads, we're more excited than ever about this space. There's even a project being worked on that would potentially hook directly into other orchestrators like Nomad or Kubernetes as an extension.

Next Up

Designing for Failure: Choosing the Right Level of Redundancy, Resilience, and Control

A practical guide to understanding zones, regions, clouds, and hybrid environments, how each layer handles failure, what real redundancy looks like, and how to design systems that stay online when others go down.

Why Organizations Choose Cycle for AI

Cycle helps teams run GPUs on any cloud, on-prem, or on bare metal. Discover how organizations are cutting costs, boosting performance, and avoiding lock-in with a portable, observable platform.

Stop Fighting Kubernetes to Go Multi Region