July 3rd, 2025 - Chris Aubuchon, Head of Customer Success

Best Ways to Find Troublesome Containers and Virtual Machines Using Cycle's Portal

The best problems are the ones you never have to deal with. That's why smart teams catch issues early on, before they impact production. Cycle gives great visibility to spot troublesome workloads, control resource usage, and take action before things go sideways.

What Makes A Workload Troublesome?

Not every spike in resource usage is a problem! However, if you're seeing sustained or erratic behavior, that can be an indicator of something deeper.

Here are the most common symptoms to look for:

  • High CPU Usage
    A workload consuming too much CPU can starve other workloads, raise server load averages, and even cause the node to crash.
  • Memory Overuse
    Workloads that slowly balloon in memory can push a host into swap, trigger OOM kills, or degrade performance for others.
  • Disk Space Abuse
    Whether it's aggressive logging or runaway temp files, workloads can fill up storage quickly. This is one of the easiest ways to knock a node offline.

Cycle does have some preventative measures that can help keep things running smoothly, even in the most demanding conditions. Alongside these helpful features, you're also able to get compelling views at different aggregation levels like: cluster, environment, and server level for many different metrics!

Let's take a look at how the platform helps keep things sane, and then dive into these monitoring views.

Helpful Platform Guardrails

The platform boasts many features that help keep things running smoothly even under extreme loads, high RAM usage, and runaway disk writes.

Server Volumes

Server Volumes dashboard

Agent volumes and logs volumes are part of each server. The goal here is twofold:

  1. With agent volumes, give the Cycle agent a place that it can always write its logs, even when the disk is filling up. If you have a runaway service that's crushing resources and possibly filling disk quickly, the last thing you want is the agent to not be able to work or write its logs.
  2. The logs volume is a fixed size volume for all logs on a given server. The big win here is, again, if you have a runaway service that is producing a huge amount of logs the disk isn't going to fill.

Reserved Service Resources

Along with the agent volume, the compute and agent services also get a small amount of reserved CPU and RAM so that they will be able to keep communicating and executing work at times of high resource usage.

Container Resource Configurations

The most concrete way to keep things from getting out of hand is setting hard limits on container instance resource usage via the container configuration. This gives the user access to setting an amount of CPU and RAM each container instance will have access to. These constraints pair perfectly with auto-scaling thresholds for even smoother operation.

While these guardrails help protect and steer your services toward success, there are even more granular views to build a complete picture about your infrastructure and workloads.

First let's take a look at a server dashboard.

Server Dashboard and Telemetry

Server telemetry dashboard

I've purposely added load to this server over the past few days and even pushed it into an "unhealthy" state to highlight this view. What you can see above is that the server is reporting its state as unhealthy, and this is currently based on the load averages (shown in yellow). The server in question has 2 threads. The generally accepted pattern for server load is 1.0 per thread on the server, so the top end for this node would be 2.0.

Server telemetry zoomed in on load cpu and ram

Scrolling down a bit, we can see that the load average, CPU usage, and RAM usage are all available in the server telemetry section with graphs for each. Using the gear, the timeframes can be set to shorter/longer timeframes based on the view you're trying to gather. In this case we can see that the server is constantly putting out about 50% usage total on the CPU and consuming about 2GB of RAM.

In order to dive into the network and disk usage, scroll down further.

Server telemetry zoomed in on network and disk usage

Finally we can see a more granular view into that disk in the compute storage graphs. Here we get aggregates in the top view (images, logs, instances, etc) and then a drilled down look at base volumes on the bottom graph. We can use this to see if there are any containers producing more logs or rouge output (think writing files not in volumes) that could become problematic for the disk on the given node.

Server telemetry zoomed in on compute storage graphs

This is a great view for getting a quick high level view of what's going on for a given server. Next we'll look at the different monitoring views and their differing levels of aggregation.

Hub Monitoring

Each hub has monitoring views at the cluster, server, and environment level. Each monitoring view shows the same type of charts and lists but has a different aggregation.

I've set up the views so that we can see what a high CPU usage container instance looks like at each layer.

Cluster Level Monitoring

Cluster monitoring view full

In this view we can see the current usage for a container instance named services-api has been consistently high for the cluster. Another interesting piece is that there are 3 other instances of the same container on different nodes with very similar resource usage profiles.

The cluster monitoring view provides a great, high-level look at usage across a cluster and gets more and more valuable as the number of containers in a cluster grows.

Server Level Monitoring

By clicking on the top server from the list and then going to that server's Monitoring tab, we can see the server monitoring view.

Server monitoring view full

Here again the services-api instance sticks out, but you'll notice that you can only see the single instance as there is only one instance with high CPU usage on this node.

The main difference between the two views we've looked at here are:

  1. The cluster view shows all container instances across all servers in a given cluster, while the server view is instances on that server only.
  2. The cluster view links to instance and server, while the server view links to instance and environment.

It's interesting to think about overlaying the CPU usage graphic on this page with the server telemetry graphs for CPU and load from the previous pages. Do you see the similarities in the usage patterns?

Environment Level Monitoring

Environment monitoring view full

As we move into the environment level monitoring view, we can again see all 4 instances of services-api with high CPU consumption.

In the examples shown, this view and cluster view probably seem very similar. But as organizational usage scales, clusters may be home to hundreds or even thousands of containers. At that scale, this more micro view of the environment becomes incredibly useful!

The most fine tuned look at this services-api container is achieved through instance telemetry. Let's take a look at that now.

Instance Telemetry

Each container instance has an instance telemetry chart. The instances use the container deployment telemetry configuration settings to dictate the telemetry report retention and sample sizes. These telemetry metrics are different from monitoring metrics. The user can control how long they are stored, the time interval on sampling, and can even disable the telemetry reporting all together (although not suggested unless there's a good reason).

Instance telemetry graphs

The image above shows the instance telemetry stream for the container that I've purposely been pushing to higher CPU usage. By default, you'll see the stream. If you'd like to change it to report just click the gear in the top right of the section and select the setting you'd like.

Here we can see several really nice pieces of info:

  1. The CPU usage is oscillating, but is also very consistent.
  2. The RAM usage seems to correspond very highly to CPU usage – also staying within a corridor of peaks and valleys.
  3. The network usage doesn't seem to be anything to be concerned with for this instance.

Bringing It All Together

So far today we've looked at:

  • The server dashboard and telemetry
  • The monitoring dashboards
  • The instance telemetry

With these resources, you can quickly catch problems before they end up affecting your users. On Cycle, you have a clear view across environments, clusters, servers, and even down to individual container instances. Follow the thread and you'll figure out what's going on FAST.

It could be anything from a container hogging CPU to a memory leak or a disk creeping towards being full. Use these techniques to spot the pattern and take action!

💡 Interested in trying the Cycle platform? Create your account today! Want to drop in and have a chat with the Cycle team? We'd love to have you join our public Cycle Slack community!