Best Ways to Find Troublesome Containers and Virtual Machines Using Cycle's Portal

The best problems are the ones you never have to deal with. That's why smart teams catch issues early on, before they impact production. Cycle gives great visibility to spot troublesome workloads, control resource usage, and take action before things go sideways.

What Makes A Workload Troublesome?

Not every spike in resource usage is a problem! However, if you're seeing sustained or erratic behavior, that can be an indicator of something deeper.

Here are the most common symptoms to look for:

High CPU Usage
A workload consuming too much CPU can starve other workloads, raise server load averages, and even cause the node to crash.
Memory Overuse
Workloads that slowly balloon in memory can push a host into swap, trigger OOM kills, or degrade performance for others.
Disk Space Abuse
Whether it's aggressive logging or runaway temp files, workloads can fill up storage quickly. This is one of the easiest ways to knock a node offline.

Cycle does have some preventative measures that can help keep things running smoothly, even in the most demanding conditions. Alongside these helpful features, you're also able to get compelling views at different aggregation levels like: cluster, environment, and server level for many different metrics!

Let's take a look at how the platform helps keep things sane, and then dive into these monitoring views.

Helpful Platform Guardrails

The platform boasts many features that help keep things running smoothly even under extreme loads, high RAM usage, and runaway disk writes.

Server Volumes

Agent volumes and logs volumes are part of each server. The goal here is twofold:

With agent volumes, give the Cycle agent a place that it can always write its logs, even when the disk is filling up. If you have a runaway service that's crushing resources and possibly filling disk quickly, the last thing you want is the agent to not be able to work or write its logs.
The logs volume is a fixed size volume for all logs on a given server. The big win here is, again, if you have a runaway service that is producing a huge amount of logs the disk isn't going to fill.

Reserved Service Resources

Along with the agent volume, the compute and agent services also get a small amount of reserved CPU and RAM so that they will be able to keep communicating and executing work at times of high resource usage.

Container Resource Configurations

The most concrete way to keep things from getting out of hand is setting hard limits on container instance resource usage via the container configuration . This gives the user access to setting an amount of CPU and RAM each container instance will have access to. These constraints pair perfectly with auto-scaling thresholds for even smoother operation.

While these guardrails help protect and steer your services toward success, there are even more granular views to build a complete picture about your infrastructure and workloads.

First let's take a look at a server dashboard.

Server Dashboard and Telemetry

I've purposely added load to this server over the past few days and even pushed it into an "unhealthy" state to highlight this view. What you can see above is that the server is reporting its state as unhealthy, and this is currently based on the load averages (shown in yellow). The server in question has 2 threads. The generally accepted pattern for server load is 1.0 per thread on the server, so the top end for this node would be 2.0.

Scrolling down a bit, we can see that the load average, CPU usage, and RAM usage are all available in the server telemetry section with graphs for each. Using the gear, the timeframes can be set to shorter/longer timeframes based on the view you're trying to gather. In this case we can see that the server is constantly putting out about 50% usage total on the CPU and consuming about 2GB of RAM.

In order to dive into the network and disk usage, scroll down further.

Finally we can see a more granular view into that disk in the compute storage graphs. Here we get aggregates in the top view (images, logs, instances, etc) and then a drilled down look at base volumes on the bottom graph. We can use this to see if there are any containers producing more logs or rouge output (think writing files not in volumes) that could become problematic for the disk on the given node.

This is a great view for getting a quick high level view of what's going on for a given server. Next we'll look at the different monitoring views and their differing levels of aggregation.

Hub Monitoring

Each hub has monitoring views at the cluster, server, and environment level. Each monitoring view shows the same type of charts and lists but has a different aggregation.

I've set up the views so that we can see what a high CPU usage container instance looks like at each layer.

Cluster Level Monitoring

In this view we can see the current usage for a container instance named services-api has been consistently high for the cluster. Another interesting piece is that there are 3 other instances of the same container on different nodes with very similar resource usage profiles.

The cluster monitoring view provides a great, high-level look at usage across a cluster and gets more and more valuable as the number of containers in a cluster grows.

Server Level Monitoring

By clicking on the top server from the list and then going to that server's Monitoring tab, we can see the server monitoring view.

Here again the services-api instance sticks out, but you'll notice that you can only see the single instance as there is only one instance with high CPU usage on this node.

The main difference between the two views we've looked at here are:

The cluster view shows all container instances across all servers in a given cluster, while the server view is instances on that server only.
The cluster view links to instance and server, while the server view links to instance and environment.

It's interesting to think about overlaying the CPU usage graphic on this page with the server telemetry graphs for CPU and load from the previous pages. Do you see the similarities in the usage patterns?

Environment Level Monitoring

As we move into the environment level monitoring view, we can again see all 4 instances of services-api with high CPU consumption.

In the examples shown, this view and cluster view probably seem very similar. But as organizational usage scales, clusters may be home to hundreds or even thousands of containers. At that scale, this more micro view of the environment becomes incredibly useful!

The most fine tuned look at this services-api container is achieved through instance telemetry. Let's take a look at that now.

Instance Telemetry

Each container instance has an instance telemetry chart. The instances use the container deployment telemetry configuration settings to dictate the telemetry report retention and sample sizes. These telemetry metrics are different from monitoring metrics. The user can control how long they are stored, the time interval on sampling, and can even disable the telemetry reporting all together (although not suggested unless there's a good reason).

The image above shows the instance telemetry stream for the container that I've purposely been pushing to higher CPU usage. By default, you'll see the stream. If you'd like to change it to report just click the gear in the top right of the section and select the setting you'd like.

Here we can see several really nice pieces of info:

The CPU usage is oscillating, but is also very consistent.
The RAM usage seems to correspond very highly to CPU usage – also staying within a corridor of peaks and valleys.
The network usage doesn't seem to be anything to be concerned with for this instance.

Bringing It All Together

So far today we've looked at:

The server dashboard and telemetry
The monitoring dashboards
The instance telemetry

With these resources, you can quickly catch problems before they end up affecting your users. On Cycle, you have a clear view across environments, clusters, servers, and even down to individual container instances. Follow the thread and you'll figure out what's going on FAST.

It could be anything from a container hogging CPU to a memory leak or a disk creeping towards being full. Use these techniques to spot the pattern and take action!

Next Up

Our Biggest Platform Release in Years: Virtual Providers and Virtual Machines

Cycle's biggest release yet introduces Virtual Providers and support for Virtual Machines—unlocking true hybrid infrastructure and positioning Cycle as a powerful Kubernetes and VMware alternative.

Introducing Server Nicknames

Cycle introduces Server Nicknames — a simple but powerful update that enhances server management across cloud and on-prem infrastructure. Say goodbye to confusing hostnames and hello to clarity, custom naming, and better DevOps workflows. Discover how this small feature makes a big impact.

The Top 4 Kubernetes Misconfigurations You Can Avoid on Cycle

Most Kubernetes misconfigurations start small — but can take down production fast. In this article, we break down the top Kubernetes security risks from the OWASP Top 10, including overly permissive RBAC, outdated clusters, and missing network segmentation. More importantly, we explore how Cycle.io takes a fundamentally different, secure-by-default approach that removes the burden of manual configuration and minimizes the risk of human error. If you're tired of fighting Kubernetes complexity, this breakdown is for you.

Next Up

Our Biggest Platform Release in Years: Virtual Providers and Virtual Machines

Cycle's biggest release yet introduces Virtual Providers and support for Virtual Machines—unlocking true hybrid infrastructure and positioning Cycle as a powerful Kubernetes and VMware alternative.

Introducing Server Nicknames

Cycle introduces Server Nicknames — a simple but powerful update that enhances server management across cloud and on-prem infrastructure. Say goodbye to confusing hostnames and hello to clarity, custom naming, and better DevOps workflows. Discover how this small feature makes a big impact.

The Top 4 Kubernetes Misconfigurations You Can Avoid on Cycle

Most Kubernetes misconfigurations start small — but can take down production fast. In this article, we break down the top Kubernetes security risks from the OWASP Top 10, including overly permissive RBAC, outdated clusters, and missing network segmentation. More importantly, we explore how Cycle.io takes a fundamentally different, secure-by-default approach that removes the burden of manual configuration and minimizes the risk of human error. If you're tired of fighting Kubernetes complexity, this breakdown is for you.