Everyone knows that a switched (or bridged) Ethernet network must not contain any "loops" - instances where a packet could find itself being sent round and round between the same set of connecting devices. In business-critical networks, though, we would often like to be able to build our networks such that should a switch fail, the packets can still find a path from A to B until we fix the broken device - a desire which, by definition, we can't service unless we introduce a loop into the network.

For any given network (not necessarily a computer network - it could be a network of roads, for instance) a Spanning Tree is defined as the set of connections that provide paths, either directly or indirectly, from every node in the network to every other node in the network without there being a loop in the network. The Spanning Tree Protocol is the mechanism by which suitably enabled devices on a network (in our case probably an Ethernet LAN of some description) negotiate with each other in order to ensure that the properties of a spanning tree are satisfied. Note, at this point, that the means by which loops are avoided is for one or more devices on the network to disable one or more ports in order to prevent a loop forming.

Spanning Tree - the basics
Let's assume we have a selection of switches in the network, all of which support the Spanning Tree Protocol (STP). Whenever a device is connected, or disconnected, to a switch port, the device where the change occurs kicks off the STP negotiation across the network. The negotiation is achieved by the sending of Bridge Protocol Data Unit (BPDU) messages between the various devices - more about BPDUs later. Step one is for the devices to elect a "root" node, which will be regarded as the focal point for the purposes of the spanning tree. The root node will normally be the switch with the lowest ID number. The ID is generally an arbitrary number with the device's primary MAC address appended. All devices of the same make and model will have the same arbitrary ID number but, obviously, different MAC addresses, and so the node that gets chosen will simply be down to the (random) MAC address. You can, however, change the arbitrary ID and thus influence the decision in order to ensure that (say) the big, meaty Cisco 6500 at the middle of the network becomes the point of focus rather than some piddly little gadget in a remote data closet.

Once all the devices are clear on who's the root, each device will flag one of its ports as a "root port" - this will be the port that has the "least cost" to the root node. This "cost" is decided by examining the data in the BPDUs as they come in, because it's the BPDU that tells them the costs of the various paths.

An example This is best explained by a worked example. Imagine we have the following network:

A1------------1B 2 1Gbit/s 21Gbit/s | | |1Gbit/s | 1 1 C2------------2D 100Mbit/s

Let's imagine node A has been elected as the root node. Before we start discussing what happens in the BPDUs, we need to understand the concept of "cost" for a network link. Although in a WAN sense there is a tangible financial cost to any link, we're more interested in a "time cost" - a faster link has a lower cost than a slower one, because it takes less time to transfer a packet across the faster link. Taking this a step further, the cost of a 10Mbit/s link would be ten times the cost of a 100Mbit/s one, as it takes ten times as long to transfer data over the former than over the latter.

Although some devices use vendor-invented figures, the IEEE has, in fact, produced a table of link costs for all the various connection types we might encounter. The ones we're interested in are a 1Gbit/s link (cost 4) and a 100Mbit/s link (cost 19).

Okay, back to the spanning tree process. Node A sends BPDUs to B (from port 1) and C (from port 2), both with a cost of 0 (the receiver adds any cost values, not the sender). When each of B and C receives these messages, it adds the "path cost" of the incoming connection (in this case 4, as the connections are both 1Gbit/sec) to the "root path cost" (the total cost of getting to the root), which in this case is 0 as this is the first place each BPDU has arrived. Each time a BPDU arrives on a port in a switch, the switch adds the path cost for the connection it came in on. It passes a new BPDU out of each port except the one on which it came in.

Making the decision
Once each switch has received BPDUs on all ports, it means that each device now knows the cost of getting to the root node on each port, and can start making decisions about the spanning tree structure itself. This is done by flagging ports as "designated" ports based on a sequence of criteria. (A designated port is one that will be active in the eventual spanning tree). First, all ports on the root switch become designated ports (unless, for some reason, you've decided to link two ports on the root switch with a crossover cable, but we'll come back to that). Next, all root ports on non-root switches become designated ports.

Let's take stock of where we are so far. At present, we've designated ports A1 and A2 (because they're on the root switch) and ports B1 and C1 (because they're root ports). We've also designated D1, because it’s a root port. At present, though, because port B2 isn't designated, D1 doesn't have anything to communicate with at the other end of its link, so switch D is isolated at present.

This is where the third step comes in. For each network segment (and a segment in this case is a link between two switches), we make one port a "designated" port. The rules are simple: assuming all switches are agreed on where the root node is, the designated port for a segment is the one that has the lowest root path cost. In our example, B2 has a cost of 4 while D1 has a cost of 8, so B2 is flagged as "designated". Incidentally, had the costs been identical, the decision would have been made based on which was the lower of the IDs of the switches switches. If the switches' IDs were the same (i.e. it's the same switch - think back to our self-connected crossover cable) it'd have been based on the port number.

So our network is complete. The designated ports are enabled for packet forwarding and all others are disabled (except that they will continue to listen for BPDUs in case the STP starts over again). All that's left to do is to watch for changes (at which point a variation of the STP kicks in to re-evaluate the structure of the network) and for the root switch to send "hello" BPDUs around every few seconds to inform the network that it's still there and everything's as it was last time it checked.

Getting around delays
The main drawback of using STP is that the negotiation takes a while to complete. Every time a device connects to the network, the time for the STP calculation (bearing in mind that the system has to figure out whether the connection of the new device will introduce a loop) to take place is measured in tens of seconds. Most vendors include the ability to enable "fast start" on one or more ports in each switch as a get-out for these delays. If you can guarantee that a port will never be presented with a device that could cause a loop (e.g. you know that it's only ever going to have end-user PCs plugged into it) you'd flag this in the device configuration. The switch would then know not to go through the STP evaluation process should that port's status change. Generally speaking, you should only enable this feature when you can absolutely guarantee that the port won't be part of a loop.