Keeping Time in Distributed Systems

We all know that keeping a server clock "on time" is an ongoing problem that computer scientists have wrestled with for decades. Nowadays, most servers keep time using a quartz crystal oscillator that's powered by a CMOS battery. The crystal vibrates at a very precise frequency and that's how it maintains accuracy (cool right).

Events like rebooting and power loss can cause the server to abandon its primary clock source and fall back to an OS-level timekeeping system. This fallback, while functional, is less precise and can cause much higher drift than the quartz oscillator.

So, what happens when server clocks drift out of sync? In distributed systems, even minor discrepancies can cause big issues. When first learning about this topic, I didn't quite understand how big a deal keeping these clocks synchronized was. So the clocks are off by a dozen milliseconds here or even a few seconds there… will that really make a major impact?

The answer was an outstanding "yes". Time drift introduces a multitude of subtle problems that can become major issues if they're not correctly managed. Data inconsistency emerges from timestamps no longer aligning. Logs get messy when a log's timestamp is out of line with an event. Security protocols fail when time sensitive tokens are invalidated because they were… created in the future. 😅

Syncing Nodes

The genesis of this blog post was actually a conversation I had with a Cycle user named Thomas. He was noticing that one of his org's application's nbf claim for JSON web tokens was created with a timestamp that was in the future!

He decided to run a set of containers on the hosts that would execute an ntpdate script and output the time to measure the drift. What he found is that the servers had a time deviation from +0.99 to -0.74, which is quite a large gap.

The root cause of this drift was most likely the CMOS battery dying that caused a few of his servers to go from using the quartz crystal to the OS timekeeping, but those servers being in a distant datacenter it's difficult to diagnose. So we opened a conversation about what we could do to make sure his servers were synced more aggressively.

The Cycle platform was using ntpd behind the scenes on a 12 hour interval to sync all the nodes. Originally, we opted into a 12-hour ntpd sync thinking that the ntpd time pool might block customer servers if they were too aggressive with time syncs. After our conversation with Thomas, we dove a little deeper and learned that ntpd, by default, uses an interval sync of around 10 minutes.

After learning this, we updated Cycle to use the same 10-minute interval. This approach allowed us to better account for the variability of systems, especially when you consider that any node can have its CMOS battery fail at any time, which without syncing on a shorter timeframe could lead to greater drift.

An Update for Everyone

When we make an update to the platform, one of the most gratifying things (at least for Customer Success nerds like myself) is seeing that update get applied to all user infrastructure at once. Automatic updates are one of my favorite parts of Cycle, but at times (no pun intended) like this, where we are able to locate a very specific edge case that could have-but had not yet-caused an issue… seeing that get sent out to everyone at once just makes my day.

Next Up

What Is Container Orchestration?

Discover what container orchestration is, why it's essential for scaling modern applications, and how platforms like Kubernetes and Cycle compare. Learn how to choose the right solution based on your team's needs, complexity, and infrastructure.

Community: Looking Back at the First 100 Days

Explore the Cycle.io Community's first 100 days. A space where clients, partners, developers, and champions come together. This milestone reflects meaningful discussions, valuable feature requests, and resource sharing across the Cycle user base.

Monthly Recap: August 2024 + Looking Ahead to September

August has been an outstanding month for the Cycle team, marked by significant advancements across the platform. We've introduced new features aimed at simplifying workflows and enhancing user experience. As we wrap up August and look ahead to September, we're excited to share the progress we've made and what you can expect in the coming weeks.