Exorcising network demons

2017-05-01 00:00:00 -0700

So for the last three weeks or thereabouts I’ve been getting weird networking issues on my Clevo P650RS laptop while plugged in to Ethernet - occasional dropped packets and whatnot. Specifically, there was a certain peculiar series of symptoms I was observing while making video calls over Ethernet from this computer: Every now and then downlink would fall to a trickle, while upload would persist for a few seconds before dropping off again.

For weeks I’ve dismissed this as Comcast (the provider on the other end) being Comcast. On the other hand, people on the other end report not having connection issues with other tasks like video streaming.

Earlier this weekend I was working on something completely unrelated - hosting a server application generating heavy UDP traffic, using a VM running on my laptop - and I noticed something strange. While communicating with this server from my desktop computer, I noticed that the connection would cut out occasionally, sometimes for a few seconds, sometimes longer.

This seems oddly familiar, I thought.

I quickly ruled out excessive load locking up the VM by keeping an eye on it in virt-manager on the laptop during these outage events, and verified by monitoring top over ssh that it wasn’t related to the specific application, but actually the connection between VM and desktop. Watching ping from the laptop ruled out the VM itself and pointed at the laptop. Swapping Ethernet ports and cables between laptop and desktop ruled out the upstream switch.

At this point, signs were pointing to the laptop itself.

While I was already fully prepared to test again at a different physical location, the whole situation was getting pretty unacceptable at this point (ping from the laptop regularly reported >10% packetloss), so I dug deeper.

The Clevo P650RS uses some variant of the Realtek RTL8111/8168 Ethernet adapter built in to the system mainboard. (The device reports it is revision 12.) In the distant past, Linux support for this adapter has been spotty: the r8169 driver has been implicated in adapter problems for years, and the standard advice is to pull the r8168 driver from Realtek and install that. (It’s possible that this advice has also been outdated for years.)

So I dutifully installed r8168-dkms from universe and gave that a try.

Imagine my disappointment when this yielded no results whatsoever.

I combed the Internet for more evidence. Most information is stuff I’ve already attempted. One post suggested that use_dac=1 is the magic ticket (it’s not). Another post suggested that connection renegotiation is to blame (it wasn’t).

Nearly out of ideas, I decided to gather more evidence. Since at this point packet loss events were reliably happening once every few minutes, I set up a ping and opened up wireshark, my Ethernet monitor of choice. Then, I waited for a packet loss event.

Then I spotted something that made everything turn into a red herring - a curious annotation on otherwise-normal ARP traffic sent from my laptop.

Duplicate IP address detected for [address A] ([redacted]) - also in use by [redacted]

Wat.

One of the earliest networking troubleshooting tricks I learned as a child is refreshing the computer’s DHCP lease. In a nutshell, DHCP is a protocol by which network addresses are assigned - a central server provides “leases” to computers that request them, entitling them to use of a certain address for a certain period of time. One way to clear up some networking issues is to have your computer “release” its current lease and then request a new one - the central server may or may not give you the same address you had before, but the address you receive should be “clean”.

This laptop had received a lease for address A in the distant past. One curious property of DHCP is that, in addition to requesting any available address, computers can also ask for specific addresses, and then receive them directly. So my laptop has been asking for address A - and apparently receiving it without controversy - for the last several weeks, and all the while, some other machine on the network has also been using address A simultaneously.

Why? One can speculate. There are many reasons by which this situation can arise, not all of which are accidental.

That aside, one DHCP renewal later, I had a nonconflicting IP address address B, and my packet loss problems promptly vanished.

Why didn’t I try this before? Because the possibility of there being a DHCP issue hasn’t occurred to me in literally years. These days, DHCP usually does The Right Thing.

The moral: sometimes the simplest solutions are really the most useful.