In recent discussions about cloud-based GPU workloads, I was struck by these two recurring challenges:
As the Head of Customer Success for a platform that confronts both of these problems, I wanted to take a moment to talk about their origin, what teams can do to mitigate, and finish up with a brief look at how Cycle might help.
One major issue with running GPU based workloads is availability of GPU powered infrastructure. Now that might not seem like an issue in a global sense. There are plenty of IaaS providers that have GPU infrastructure as part of their SKU, so what could possibly be the issue?
Imagine this scenario. You've deployed to AWS and you're running a multi AZ setup in a single region. You need a p2.xlarge or p3.2xlarge (the smallest machines on the current SKU), so your queuing program queues up the work to be done and the rest is history…. Right?
Turns out, more and more frequently, those machines aren't available. As in AWS, in the given region, it is simply out of that type of machine. In an effort to figure out why your jobs never made it out of their queue, you inspect and find the information. Adjusting the parameter, you start allowing for the next size up for p3 which is 8xlarge. Given the ephemeral nature of the program you're running you have it set up use on-demand and at the end of the month you find that:
So you haven't solved your problem and you've gone over budget in one awesome swoop! Yay (sarcastically)...
Let's address the other major challenge. A single developer can only specialize in so many things and being able to deploy GPU powered workloads to a platform isn't always easy.
The reason this problem makes sense to talk about in the context of this article is due to the fact that, if you are able to set up a single provider to run the GPU workloads, having that fail, go over budget, or both can feel like a major roadblock. For specialized teams, it's a choice between hiring another team member to advance core work or an Ops expert to manage workloads more efficiently.
Now I'll never argue against the importance of ops. It's essential to every team's core strategy; understanding the how, when, why, and what of their production deployments. All I'm positing is the idea that when constrained against resource caps, working to push a project as fast and efficiently as possible, it's important to weigh the decision against the best possible outcomes.
Undoubtedly, numerous organizations strive to enhance the GPU workload experience. The real question is becoming, what platform will give our team the highest likelihood of success?
For Cycle users, deploying GPU infrastructure is the same process as deploying any other server (vm, bare metal, etc). The standardization doesn't stop there though, our users will also have a similar experience deploying their containers, scaling, migrating, configuring… the only noticeable difference is the need to set a couple of environment variables that tell the underlying platform that you expect the GPU driver to be mounted.
The other advantage Cycle users maintain is that Cycle is multi-cloud native. That means you can pick GPU powered devices from any of the supported providers that have them (currently GCP, AWS, and Vultr). Being multi-cloud native all but eliminates the chances that your workloads will go unprocessed and with automated provisioning, networking, and deployment (which can be further enhanced through the API), you won't need an expert on your team to get things online.
With several platforms in development, leveraging and extending Cycle to manage GPU-enabled workloads, we're more excited than ever about this space. There's even a project being worked on that would potentially hook directly into other orchestrators like Nomad or Kubernetes as an extension.
💡 Interested in trying the Cycle platform? Create your account today! Want to drop in and have a chat with the Cycle team? We'd love to have you join our public Cycle Slack community!