A Weird Server Problem

At St. John’s, we’ve been limping along with a creaky set of infrastructure, workstations, and server for a long time. Long enough that the server was still running Fedora 2 (!), which is very old. But it worked, and I understood it, and could maintain it with little problem. Every once in a while a component would fail, and be replaced, and the software would get updated.

Recently, there were a spate of problems that had to be addressed. I was in Muskogee a couple weeks ago, and Internet access for St. John’s was very bad. I worked with the server and Cox (our ISP) and we determined that the cable modem was having problems. We bought a new one (see related post) and Ian and I got it replaced, and all seemed well. We ran about 10,000 pings in flood mode and the link from the server to the modem was stable with no packet loss, so we went home. Until the next day…

I was on the way to SLC, and the school called, and there was still a problem with Internet access. From Dallas, I pinged the modem gateway and the server, and all seemed fine. When I got to SLC, I remoted in and noted a LOT of lost packets, 30-40%. Ian went up there that evening and found another network card and installed it, ran a ping test, had no lost packets, and he went home. All was OK for a week or so.

We started having a spate of problems; server reboots would cure them for a couple hours or more. Finally, last Thursday, it got to the point where we were suffering 80-90% packet loss. I started some serious troubleshooting that evening. First I connected my laptop directly to the cable modem and pinged the hell out of it, then started downloading huge files, no problems found, so it was certainly on the St. John’s server side. The network cable was OK as well I figured I had another bad/failing network card, so I replaced it with a donor card. Still lost packets after about 10 minutes of good operation. So I bought a new card (these are PCI). Same behavior. Now I figured I had a bad PCI slot, moved it to another, same behavior. WTH? So now I figured I had a failing motherboard. I had another machine that is a twin to the server, so I pulled the Fedora hard drive out and moved it to the other machine. It booted just fine, worked for network access, I thought we were in good shape, and then, after about 10 minutes, I started seeing lost packets again!

The only thing in common with all these problems was the hard drive with Fedora.

I still don’t know what the problem was. I wonder if there is some overflow related to the long time the server has been in service, or an intermittent disk error. That would have to be in the logging system, since packet movement is entirely in memory. I will look at that later, maybe.

I’m in the process of rebuilding a new server, most of it is working, and I will finish the rest of it today. But that’s the subject of another post…


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: