We all know that keeping a server clock "on time" is an ongoing problem that computer scientists have wrestled with for decades. Nowadays, most servers keep time using a quartz crystal oscillator that's powered by a CMOS battery. The crystal vibrates at a very precise frequency and that's how it maintains accuracy (cool right).
Events like rebooting and power loss can cause the server to abandon its primary clock source and fall back to an OS-level timekeeping system. This fallback, while functional, is less precise and can cause much higher drift than the quartz oscillator.
So, what happens when server clocks drift out of sync? In distributed systems, even minor discrepancies can cause big issues. When first learning about this topic, I didn't quite understand how big a deal keeping these clocks synchronized was. So the clocks are off by a dozen milliseconds here or even a few seconds there… will that really make a major impact?
The answer was an outstanding "yes". Time drift introduces a multitude of subtle problems that can become major issues if they're not correctly managed. Data inconsistency emerges from timestamps no longer aligning. Logs get messy when a log's timestamp is out of line with an event. Security protocols fail when time sensitive tokens are invalidated because they were… created in the future. 😅
The genesis of this blog post was actually a conversation I had with a Cycle
user named Thomas. He was noticing that one of his org's application's
nbf
claim for JSON web tokens was created with a timestamp that
was in the future!
He decided to run a set of containers on the hosts that would execute an
ntpdate
script and output the time to measure the drift. What
he found is that the servers had a time deviation from +0.99 to -0.74, which
is quite a large gap.
The root cause of this drift was most likely the CMOS battery dying that caused a few of his servers to go from using the quartz crystal to the OS timekeeping, but those servers being in a distant datacenter it's difficult to diagnose. So we opened a conversation about what we could do to make sure his servers were synced more aggressively.
The Cycle platform was using ntpd
behind the scenes on a 12
hour interval to sync all the nodes. Originally, we opted into a 12-hour
ntpd
sync thinking that the ntpd
time pool might
block customer servers if they were too aggressive with time syncs. After
our conversation with Thomas, we dove a little deeper and learned that
ntpd
, by default, uses an interval sync of around 10 minutes.
After learning this, we updated Cycle to use the same 10-minute interval. This approach allowed us to better account for the variability of systems, especially when you consider that any node can have its CMOS battery fail at any time, which without syncing on a shorter timeframe could lead to greater drift.
When we make an update to the platform, one of the most gratifying things (at least for Customer Success nerds like myself) is seeing that update get applied to all user infrastructure at once. Automatic updates are one of my favorite parts of Cycle, but at times (no pun intended) like this, where we are able to locate a very specific edge case that could have-but had not yet-caused an issue… seeing that get sent out to everyone at once just makes my day.
💡 Interested in trying the Cycle platform? Create your account today! Want to drop in and have a chat with the Cycle team? We'd love to have you join our public Cycle Slack community!