A company I was working with recently ran into a problem with one of their remote branches, connected back to head office via the Internet. They’d set it up properly, with IPSec to ensure security and everything was working fine.

Then one day an application stopped working. Or at least parts of it did. Most things continued to work normally, but certain operations within one application started failing. The application was accessing a server back at head office and all other branches, using the same desktops and services, were carrying on as normal.

This was a new application, written in-house, so the developers should have had a pretty good idea where problems might lie. Their opinion, though, was that since it looked to be something specific to that branch, it had to be a network problem.

Connectivity looked fine, routing tables were normal and communication between the desktops and that same server worked for other parts of the application. So out with the sniffers.

Which is when it was found (after a fun-packed time looking through network traces) that the packet sizes between server and client for this part of the application, which was transferring large amounts of data back to the client, were larger than those in the other functions, which were primarily single record data inputs.

And for a reason nobody, least of all the application developers, knew, the Don’t Fragment bit had been set in the IP header.

So why had this worked before and not now? What should happen is that the two hosts that want to communicate negotiate a TCP message segment size (MSS) to use. If it transpires that the MTU of a router in the path between them isn’t large enough, the router will fragment the packet transparently.

If however the packet has the Don’t Fragment bit set, the router should (and most do even on the Internet) send back an ICMP Type 3 ‘Destination Unreachable’ Code 4 ‘Fragmentation Needed and Don’t Fragment Set’ error to the sending host, at which point most systems will reduce the MSS and everything will spring into life again.

In this particular instance, however, this couldn’t happen, as the router at head office was filtering out all ICMP traffic, so that error never got through. And the reason that this problem had suddenly occurred was that the path through the Internet from one site to the other had changed due to some routing change - completely invisible to the network ops team, as they just monitored the state of the VPN tunnel, not its hop-by-hop progress through the Internet.

If this had happened over a private WAN, it might have been easier to spot, since a routing topology change might have been noticed to coincide with the problem, which might have given some pointers on where to start looking. As it was, the application MSS was lowered and the problem was cured.

The company is still deciding whether to rewrite some access list rules to let specific ICMP messages through, although even if they do, there is still the chance they might come up against a router in the future that doesn’t issue these ICMP messages as it’s not mandatory that they do, and as security tightens, who knows what helpful error messages may be removed by service providers.