Timing is everything. Okay, there are maybe a few ways you can interpret that statement, but I’m talking about network timing here. Remember when you just set clocks locally on network devices, so they had a rough idea what day it was? And when it wasn’t the end of the world when they rebooted, came back up thinking it was 1992, and nobody bothered to reset them?

Of course that isn’t good enough now. We need devices to have accurate times in their logs so we can correlate events. Routing protocols can use time-based authentication, access-lists can allow or restrict traffic based on the day of the week or time of day, and the certificates we use to secure the likes of VPNs need to know what time it is. And they all need to agree.

NTP, the Network Time Protocol, was designed to allow us to synchronise our servers, router, switches and practically anything you care to mention to a ‘master’ clock source. If you don’t happen to have your own atomic clock handy, you can get a timing feed from your Service Provider, or even, if you’re stuck, use the clock on the likes of a Cisco Catalyst 6500, which won’t reset if it reboots. Not as accurate, of course, but at least everything in the network will agree, which is half the battle.

You would think, since NTP has been around in one form or another for 20 years, that we’d be able to get it right by now. Depressingly that’s not always the case.

Just as a recap, NTP uses this idea of an authoritative time source and applies a stratum level to indicate how far away from that source each device is. NTP messages can be cascaded throughout your network—the stratum levels allow your devices to take their timing from the ‘best’ (i.e. lowest stratum level) source or at least use the NTP updates that have come the most direct path and are therefore considered to be more accurate.

And that’s where you might run into problems. NTP works either by broadcasting timing information out, or—more securely—forming a relationship between NTP clients and servers (which can then be authenticated to stop someone sending erroneous timing information into the network), whereby the clients send requests to specific time sources. 
So clients know not to accept information from sources with worse stratum levels than theirs, and sources know not to bother sending updates to clients with better stratum levels than their own.

Some Unix/Linux servers get delusions of grandeur and default to using NTP stratum level 0 if they’re not set up properly, which, let’s be honest, is as authoritative as you can get. Why someone thought that was a good default, I have no idea, but no NTP source is going to try and update them, so NTP just won’t work. Not a lot of people know this—including a fair few server administrators I had dealings with a little while ago, who were complaining loudly that the NTP source wasn’t working, the network wasn’t routing properly or that their traffic was being blocked by the network. In other words, that it was the network’s fault. The judicial capture of a few packets coming out of the servers and we could quickly identify the issue. And I did try not to be too smug about it. 

By the way, NTP can synchronise network devices to within a millisecond, typically needing just a packet or two per minute to do so. But if that’s not good enough for your fast-moving environment, fear not—the IEEE has recently revamped its standard for a “Precision Clock Synchronization Protocol for Networked Measurement and Control Systems”, or PTP (Precision Time Protocol) to you and me, which will allow nanosecond accuracy. Network kit is now starting to support this. You have to wonder just how accurate we need to be.