Strange server failures explained

Strange server service problems

This was one of the more bizarre failures I've seen in awhile, worthy of its own article. It takes a strange and diverse collection of things that just don't work right all the time and gives them a single, unified explanation.

1: Over the last few years, I stopped using TCPView on my servers because it took too long to plot and had too many lines to display, which I always put down to the characteristics of a server with all the chatting it is doing everywhere all the time.

2: Every now and then a service or two that a particular server is providing just doesn't work right, and a reboot fixes the problem.

3: Recently, I had this funny feeling that my phone wasn't dinging me every time an email came in like it used to. It still received email when I asked it to, I just didn't remember it dinging on a regular basis like it did in the past. Didn't bother me enough to even think about looking into it.

There have been many other times where things just didn't start up on the server the way they should have. Like one time I couldn't VPN into my server from the outside, though I could get to the firewall via its VPN just fine. A reboot always fixed things, so I put it down to some random act of Microsoft server software flakiness and moved onto more pressing items.

Quote
Click Here for Press Release

Everything changed today!

I was going after another totally unrelated issue having to do with publishing to a web site with Front Page Server Extensions with Microsoft Office SharePoint Designer 2007 and was having some issues getting the whole site to publish (which might become a whole different article!) when I checked into the server to see what the problem might be.

And I tripped on something ... something very interesting!

Event logs

I'll always check the event logs first. Remember, I was looking for something about Front Page and IIS and publishing a web site to the server, and this is what I found:

Event Type: Error
Event Source: Server ActiveSync
Event Category: None
Event ID: 3015
Date: 10/26/2012
Time: 5:08:32 AM
User: N/A
Computer: SBSDUAL833A
Description:
IP-based AUTD failed to initialize because the processing of notifications could not be setup. Error code [0x80004005]. Verify that no other applications are currently bound to UDP port [2883], or try specifying a different port number.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Event Type: Error
Event Source: Server ActiveSync
Event Category: None
Event ID: 3024
Date: 10/26/2012
Time: 5:08:35 AM
User: N/A
Computer: SBSDUAL833A
Description:
IP-based AUTD failed to initialize. Error code: [0x80004005].

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Both eventIDs 3015 and 3024 repeating every 5 or 10 minutes over and over again... all day, all night long:

Server ActiveSync is how Exchange communicates with mobile devices to tell them email has arrived.

The first event copied above makes WHAT is happening fairly obvious - some other service is sitting on UDP port 2883, which happens to be the magic UDP port that Server ActiveSync wants to grab onto, so it can't. Which is why my phone wasn't telling me when an email came in. But what could possibly grab it?

I'd completely forgotten that I can't really use TCPView on the servers because of how many lines it wants to display and how long the screen takes to update... and I was reminded of this problem when I tried it. The screen plotted so many lines on the screen it couldn't keep up with the updates. I know this sounds unrelated, but you'll see in just a minute how it is all connected.

So back to the old stand-by, netstat!

So now I knew the process ID #1692 was sitting on port 2883. Task manager showed that process ID 1692 was DNS.EXE!

(The keenly observant among you will notice that dns.exe has a process ID (PID) of 8052 and not 1692. That is because I grabbed this print screen after I'd already stopped and started the DNS server service. Little did I know I'd write up an article on this, so I neglected to get the print screen prior to the restart. I could have fudged the screen, but ... )

Port 2883? Since when does DNS use port 2883? It is supposed to sit around on TCP 53 and UDP 53... And maybe chat with other DNS servers to keep things in sync. At least that's what I'd read in the past or seen when I've sniffed data.

Curious, I asked netstat what other ports was process #1692, my DNS.EXE server service, listening on and was shocked:

this continued on for pages and pages till we got to the bottom:

Yes, it was all over the place.

Stopped it, told Server ActiveSync it can start up again (for the curious, on SBS 2003 and Server 2003 is serviced by w3wp.exe on UDP port 2883), and all of a sudden my phone dinged again when an email arrived. Quick look in the event viewer and:

Yes, the process did initialize itself!

Drilling into the root cause

Sometimes just fixing a problem now is good enough. When there is a fire burning, you put it out NOW! But if you fix the symptom and don't find the root cause, chances are the problem will happen again. If you don't fix the reason the fire started, chances are it will just start again sometime in the future. I've met a lot of "system administrators" whose fix-it tool of choice is to reboot and hope it doesn't happen again.

I'm not one of them! In fact, I have a saying - if something very unlikely and strange happens once, it is a "one-off" and you can usually ignore it. Twice, it could be a coincidence. Three times, it is a pattern. Pattern failures that take a long time to recur are incredibly frustrating as you can't really say if something you did fixed the problem or not until a lot of time has passed without the problem recurring. They are puzzles, and I like puzzles - and this looked like it was going to be a good one. So searching for the root cause was the next task! I've seen this enough and now I have a nice clue - the question is why and what to do about it.

Some google searches for "server 2003 dns listening" gave pointers to various articles:

http://technet.microsoft.com/en-us/security/bulletin/ms08-037
http://support.microsoft.com/kb/953230
http://support.microsoft.com/kb/956188/en-us

The 3rd link had gold, in the Detailed Cause section:

"The implementation of the DNS server security update reserves a set of ports when randomizing queries. This design decision was made to address performance concerns for DNS servers that handle and originate a significantly larger number of queries compared to Windows-based clients. The set of reserved ports by the DNS Server is referred to from here onward as a "socket pool." The default size of the socket pool on Windows-based servers is 2,500 sockets."

Ok, so DNS.EXE is going to latch onto 2500 random port numbers between 1025 and 65535... hmmm.. And what prevents it from stepping on some other service's port?

Some quick math (2500 / (65535-1024)) * 100 = 3.875% chance that on any given boot DNS will step on a port some other service wants, assuming it only wants one port and that DNS started before that other service did.

As "Mr. Magical" Marshall Brodien used to say, "Its always easy, once you know the secret!"

Curious, I looked and another 2003 server (not SBS) was doing the same thing, only his port started way higher:

(thumbnail image, click for one larger)

The other server's random port usage started higher - around 49,000 or so.

Later, I also found a link an article discussing what ports needed to be reserved on an SBS 2003 server:

http://support.microsoft.com/kb/956189

Problem Summary:

By design DNS grabs a bunch of randomly numbered ports. This can step on ports that other services are pre-programmed to use, which if DNS starts before any of those other services then there is a 3.875% chance that other service will not run correctly because DNS hijacked the other service's port.

What to do about this...

So my real questions:

1) How can I prevent the random port grabbing from stepping on my server's toes?
2) Can I make TCPView usable again?

Reading more of the articles, I can tweak this:

The default size of the socket pool on Windows-based servers is 2,500 sockets. This size is configurable by modifying the SocketPoolSize registry entry in the following subkey in the registry: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters\SocketPoolSize

That would make TCPView more usable. The lower I make it, the more I'm sacrificing the added randomness Microsoft implemented to prevent the TCP Cache Poisoning vulnerability.

The other thing I can do is tweak the ephemeral port allocation range... and still more reading:

"After you install security update 953230 on Windows Server 2003 and down-level platforms, the following conditions are true:
* If the value of the MaxUserPort registry entry is set, the ports are allocated randomly from the [1024, MaxUserPort] range.
* If the value of the MaxUserPort registry entry is not set, the ports are allocated randomly from the [49152, 65535] range.

The MaxUserPort entry is at:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\MaxUserPort

Server 2008 uses the range of 49152-65535 no matter if the update is installed or not. And you can set or change it via:

netsh int ipv4 show dynamicport [tcp|udp]

So I'm guessing I do have MaxUserPort defined on the SBS 2003 server and no MaxUserPort defined on the stand alone 2003 server.

I did not have SocketPoolSize defined on either server - so I created the DWORD, and set it to 50. Now TCPView works again!

As for the MaxUserPort parameter ....

My SBS 2003 server did have the parameter defined:

which explains why he would sometimes (3.8% chance) step on ActiveSync.

My stand alone Server 2003 box did not have it defined:

So he was OK.

Why would SBS 2003 define that parameter?

Setting up the system to have more than 16,000 ephemeral ports ? Not sure - found references that say I shouldn't delete the key, but no real reason why. I really respect Susan Bradley, the SBS Diva and her knowledge - but this one I'm going to disagree with for now. I might change my mind later, but right now it is "Delete the key." None of my clients run their SBS Server with more than 20 people, and if that isn't enough ports to use then I'm thinking they've been hacked. So for now, I'm going to delete the key from the SBS server and see what happens.

Stay tuned! I might have to change my mind if this proves fatal!

More to come...

UPDATE 8/2014:

It has now been 22 months since I fixed this. Not once has the problem recurred. I've seen no bad effects from the changes. This exact fix has been applied to numerous client computers, none of which have reported any ill effects. I'd have to say this problem is now solved!

If you found this helpful, please send me a brief email -- one line will more than do. If I see people need, want, and / or use this kind of information that will encourage me to keep creating this kind of content. Whereas if I never hear from anyone, then why bother?

I can be reached at:
das (at-sign) dascomputerconsultants (dot) com

Enjoy!

David Soussan