When users start to complain about not being able to get to their services and applications, you need to quickly be able to resolve what’s causing the issue. This can be speeded up dramatically if you get good information regarding the fault —how many users are affected, can they get to another application on the same server, have there been any changes to the network? But it’s also vital that you go about your troubleshooting in an orderly manner, to help you pinpoint the problem area fast. This article will concentrate on IP-specific troubleshooting, with reference to some easy-to-use tools, such as ping and traceroute..

In an IP-based network, the sensible thing to do is start at the physical layer and work your way up and out. Don’t waste your time testing higher-layer application connectivity across a WAN to a server until you’re sure it’s physically reachable from its local LAN.

Test Methodology

  • Check physical connectivity
  • Verify connectivity from local LAN
  • Carry out layer 3 network testing
  • Prove name resolution
  • Test application layers

Be systematic—ping the server from the LAN it’s on. If that doesn’t work, look at the local set up. If it does, try pinging it from the LAN the complaining users are on. This will give an indication of a local or network problem. Can you ping by IP address and by name—if not, you may have a DNS problem. If all your pings work, then you’ll need to move higher and look at access lists and filters, MTU sizes and application errors. Each step is detailed further below.

Physical Connectivity
Users are complaining they cannot get to an application residing on a server. Look at the switch port that that server is connected to. Is there a physical connection? Is the port up? Are you seeing a high error rate on that port? Speed or duplex mismatches, or faulty NICs should be obvious to spot.

Local LAN
Verify that the local IP configuration is correct. Is the address statically or dynamically defined? If static, make sure it is the correct address, and has the right subnet mask and default gateway assigned. Remember that you need to prove connectivity in both directions—you may find that all your users’ PCs can get to the server address fine, but it has the wrong router address configured and doesn’t know how to get back.

Winipfcg, or ipconfig from a DOS prompt, depending on the platform, will tell you what you want to knw, and let you release and renew leases to make sure things are working properly.

If you’re using DHCP check that the leases are valid. If users can’t get DHCP addresses, are the scopes exhausted? If the DHCP server is on the other side of a router, you’ll need an IP helper address configured into your routers to allow the requests through. Make sure you don’t have duplicate IP addresses for your server, router or client machines. Either switch off or unplug the device you’re interested in. If you can still ping it, then something else is using its address. If you can’t disconnect, check arp caches to make sure that the MAC address is the one it should be.

Make sure that the host can get to its default router. If you’re running VRRP or HSRP, is it functioning properly? Try both the virtual and real IP addresses. Make sure both routers are set up—it wouldn’t be the first time that a standby router has come online only to find that a config change made to the primary router has been missed off the other.

Network Testing
If everything locally is fine, then you may have a routing issue. Start at one end and make sure that you have routes to both source and destination. Check cpu utilisation for rogue processes that indicate a problem and are impacting traffic.

This is probably where you’ll start to use traceroute, rather than ping, as this will help you highlight the problem area faster than doing a step by step check, particularly if you have a relatively large or complex routed environment. Then use ping to zoom in specifically on that part of the network.

A common symptom of a routing loop or blackhole is to see say every second ping to a destination succeeding, with the alternate one failing. This is generally an indication that the router believes you have two paths to a destination and is trying to load balance. If only one of those paths is valid, you’ll lose half your traffic.

Name Resolution
For every connection you can make using an IP address, confirm that it still works if you use the host name. Use nslookup to resolve names—this also allows you to choose the name servers, so you can test them all. It may be that your router is using one, while client PCs use another, which could give inconsistent test results. If you can’t resolve a name, make sure the DNS server address you’re using is correct and reachable before you rush off to tell the server admins that they’ve not configured names properly.

Application Testing
If you’re satisfied that you have IP connectivity across your infrastructure, then it’s time to look at the application itself. This may now require the use of a packet analyser to actually see what the application is sending out.

Find out the port numbers used and make sure there aren’t any filters in place blocking those specific ports. If the traffic has to cross different media types, chances are that different MTU sizes will be enforced along the path. Check for the Don’t Fragment bit being set by the application—if somewhere further down the path a router interface is set with a lower MTU size than on the originating LAN, you may find that the largest packets are being discarded. This would result in an application seeming to start up, but then as soon as a large screenful of data has to be passed, for instance, the application uses larger packets than can be transported end to end without fragmentation, and the application will appear to hang as the router is forced to throw data away. This wouldn’t be obvious just through the use of pings with large payloads, since typically the DF bit would not be set, so the routers would fragment and reassemble quite happily.

You will also have to watch out for applications that embed IP addresses within the data as well as using them in the IP header. If you’re running Network Address Translation (NAT) then chances are this won’t work. Some apps use hard coded addresses in the data, so if you change a server address, even though the source or destination address look right in the header, there’s a mismatch. Be aware that there’s unlikely to be a quick fix for this since the application coding will need rewritten so if you have a problem that has just arisen after a host re-addressing change, the only workaround may be to back out completely.

IP tools
The main troubleshooting tools you will use are ping and traceroute. When initiating a ping, by default, if the host has multiple interfaces, the ping will be sent from the one ‘nearest’ the destination. Be careful when sending a ping from a router because your network may be set up to treat router-originated traffic differently from user traffic, with access lists, for instance.

Extended ping commands, depending on router vendor, allows you various options, including the ability to select your origin IP address (within reason), change packet sizes, set ToS values and set the DF bit, for example as shown in the output below, taken from a Cisco MSFC.

Protocol [ip]:
Target IP address:
Repeat count [5]:
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]:
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to, timeout is 2 seconds:

If you are using a Windows client, you get various options too. By default, the ICMP packet size that is sent out is small (32 bytes). Using the -l (size) option you can send out larger packets when you need to. This can be useful in determining if you are having a packet size problem across routers. The available options for the Ping utility with Windows are as follows:

-t : Ping the specified host until interrupted.
-a : Resolve addresses to hostnames.
-n count : Number of echo requests to send.
-l size : Send buffer size.
-f : Set Don't Fragment flag in packet.
-i : TTL : Time To Live.
-v TOS : Type Of Service.
-r count : Record route for count hops.
-s count : Timestamp for count hops.
-w timeout : Timeout in milliseconds.

Traceroute will actually show you the hop by hop path taken by a data packet, by sending out UDP messages or ICMP echo requests, with incrementing TTL values starting at 1. The first packet will be discarded at the first hop along the path, since it will have expired the TTL value, and the first-hop router will return an error message to that effect. By increasing TTL values, traceroute learns the path along which data will flow to get to a particular destination address. Typically multiple packets will be sent for each TTL value, to learn about multiple paths in cases of routing protocols performing load balancing.

The DOS version is tracert—again you get options to help customize what you want to actually test for.

The syntax is:

tracert [-d] [-h MaximumHops] [-j HostList] [-w Timeout] [TargetName]

With the available options being:

-d : Prevents tracert from attempting to resolve the IP addresses of intermediate routers to their names. This can speed up the display of tracert results.
-h MaximumHops : Specifies the maximum number of hops in the path to search for the target (destination). The default is 30 hops.
-j HostList : Specifies that Echo Request messages use the Loose Source Route option in the IP header with the set of intermediate destinations specified in HostList. With loose source routing, successive intermediate destinations can be separated by one or multiple routers. The maximum number of addresses or names in the host list is 9. The HostList is a series of IP addresses (in dotted decimal notation) separated by spaces.
-w Timeout : Specifies the amount of time in milliseconds to wait for the ICMP Time Exceeded or Echo Reply message corresponding to a given Echo Request message to be received. If not received within the time-out, an asterisk (*) is displayed. The default time-out is 4000 (4 seconds).
TargetName : Specifies the destination, identified either by IP address or host name.
-? : Displays help at the command prompt.