I’ve just had a bit of a run-in with some application developers. I know; it’s happened before, and will probably happen again. But do they have to keep breaking my network?
Couple of weeks ago, we got a call from one of the boss-men’s PAs. The network was so slow that nothing was working. Priority call, obviously—the boss-men may be important, but they’re nothing compared to the PAs.
When we had a look at the switch, the CPU utilisation was way above normal. It’s a Catalyst 6509 with Sup720—it barely draws breath most of the time, but it really wasn’t happy today. A bit more digging and we were seeing higher interface usage than usual too, but that in itself shouldn’t have been causing a problem.
Until we had a look at the traffic in a bit more detail. And found an awful lot of multicast traffic that hadn’t been there the week before. Strange—nobody had mentioned any new applications being rolled out over the weekend that would account for this, but again, the network should be able to cope with multicast traffic no problem.
Except that this multicast traffic was destined to 126.96.36.199. And there was lots of it. And since that’s the destination address for ‘All OSPF Routers’ and our Catalysts are Layer 3 devices running OSPF, it meant that the switch itself was having to process every one of these packets. Hence the CPU going through the roof.
But the network was stable (apart from this problem)—there should have been very little OSPF traffic. So what was this?
Everything was coming from one source address, so it wasn’t difficult to find out where it was plugged in and disable the port. Traffic and switch CPU returned to normal. I went for a walk.
When I got to the area where the floor point was that connected to that port, I found a group of—yes, application developers. And one of them was looking particularly unhappy and staring at a screen that didn’t seem to be doing much. Bingo.
After an open and frank discussion, I found out that he was having a look at a new application they’d been playing with, and since it was pretty trivial and wasn’t going to cause any problems, he had decided it wasn’t worth going all the way down one flight of stairs to the test lab they had, but he would just run it on his own machine
He was sending out a multicast stream, but it was okay, he knew what he was doing so didn’t pick anything stupid like 188.8.131.52 or 184.108.40.206 since he knew they were for all hosts and all routers. So he used 220.127.116.11
He wasn’t sending out much traffic anyway, as he had set his application to send out one of these packets every three seconds. Oh hang on; maybe that parameter was milliseconds rather than seconds
There’s not a jury in the world would convict me, is there? The annoying thing is that there’s actually not much I could have done to stop this problem happening. Port security wouldn’t help—we limit the number of MAC addresses allowed on a port to stop hubs being plugged in or CAM table attacks, but that wouldn’t have done any good here. Access Lists? Theoretically, but where do you draw the line at what you block?
Broadcast suppression (okay, multicast in this case) would have limited his traffic, and could even have shut the port down, so maybe we should start rolling that out more. We’ve never really thought we needed it before. In a way it’s lucky he did mess up his transmit time, and send packets out every three milliseconds instead of the three seconds he’d wanted, or we might not have noticed the traffic at all, and we’d still have a load of guys who thought it was okay to run test traffic over the live network.
They don’t any more.