Soussan DAS Computer Consultants


Our Team
Solutions
Projects
Clients
Contact
Cool Stuff
KeyholeKeyboardLaptop ComputerComputer Chip
 


Another interesting network problem
 

If I ever do another webcast for Microsoft on debugging with a network sniffer, this problem will be included!

Warning: This is some deep technical content - deeper than I normally go. I'll do some explaining, but I'm going to assume you've got some familiarity with data flowing through networks at a protocol level.

So strap on your propeller hat and get it spinning at a high speed!

Quote
Click Here for Press Release

Here a jack, there a jack, every where a jack, jack

This particular site has both a factory area and an office area and both are networked. When a new assembly line is opened up, the new machines all have network jacks and can be configured, monitored, and alerts sent all via the network. This is great because what used to be either isolated stations you had to walk around to monitor or listen for a buzzer when there is a  problem are now automated in how they report their status.

When the factory needed more network drops, another switch was hoisted onto a support beam and wires run to the various devices. When they needed a wireless point for the floor, they had a switch right there to plug into and now your iPhone gets status from your factory machines.

 
As the factory needed jacks, they added switches - now the whole factory is networked along with a collection of offices and cubicles each with their own PC, various wireless network access points, a security camera system with remote monitoring capabilities, access card readers on various doors, a network circuit to another location many states away, and racks of servers in their server room. It is a moderately complex network with under 500 devices total.

The factory folks have their own IP address range with their IP address block static. Same with the servers, WAPs, doors, cameras, etc. The PCs are all use DHCP. They've kept pretty good documentation of what devices are at what IP and MAC addresses, and poor documentation of the factory's switches, wiring, and equipment distribution on the floor - that is "their area", though IT gets called when something didn't just plug in and work.

That is a brief background to launch into....

The problem:

I got the call one day: "Our door access card system keeps going off-line. Plus, people randomly get timeouts when going to websites when we know the web is working fine, though AT&T says we need to upgrade to fiber." (they currently have two T1s bonded). "Sometimes we can't get onto the wireless from the factory floor. The other day, someone couldn't get onto the network and they had some wild IP address, and we don't know where that came from. One of the factory machines sometimes can't print to its network printer, and resetting the machine fixes it. Things are just flakey - help!"

I've told the laffy taffy joke before ... "How do you eat an elephant?"

"One bite at a time."

So I started with the most reliable and repeatable failure, which was the door access system. Here is a snip from the door access system's log screen:

If you try to get into a door when it is disconnected, you don't get in. Being locked out is bad! Especially when it is snowing!

This happens all the time, at random times throughout the day. Sometimes continuously. Sometimes it works just fine.

Visual analysis

From the computer that generated the screen above, I opened up a command prompt and started 'ping -t' command to the door controller's IP address. Lots of missed pings, lots of high echo times.

I opened up the cabinet for the controller and saw LEDs for link and activity. Link was solid on (good!) and activity was solid on (Hmmm...)

Some devices like older 3Com switches (3300) have an activity LED that is on when traffic is present, off when it isn't. Solid on means solid traffic, not lit means quiet as a mouse. Other devices (some of the older Cisco switches, the 2900 series comes to mind) have LEDs that blink at a fixed frequency whenever any traffic is seen and give no other indication how much traffic is present.

This door system was solid on, and unfortunately I didn't know enough about it to know how its LED behaved.

Yet.

If you've read the botnet article, or watched the webcast, or talked to me about your database performance, you know I'm a big fan of looking at the network data to figure out where to go look for a problem that can hide in hundreds of different places. So I did what has been my "big problem, where do I look?" diagnostic step yet again.

Pardon me while I whip this out:

Transition Networks Pocket Hub-8 Front

Front view, with obligatory blinking LEDs and ...

Transition Networks Pocket Hub-8 Rear

Rear view. This is my network hub. There are many like it, but this one is mine. My network hub is my best friend. It is my life. I must master it as I master my life...

You can read about hubs and switches and stuff in the supplemental information for the webcast here. But this little "2 for $10 on closeout" hub and I have been together for so many years I had to give it a little spotlight. In fact, I bought 4 of them.

Anyway, hooking the hub in at the door system along with a notebook running Wireshark to capture a few packets, this is what I saw right away:

First sniff, multicasts

If that doesn't that scream at you what the problem is, then you probably don't own your own propeller hat, and that is OK - I'll loan you mine. Look carefully down the column "Destination" and look for a pattern. The first number of the destination address is 239, which is a multicast address. The sources are all over the place - 10.10.200.x is the range IT gave to the factory to use for their equipment.

So lots of devices on the factory floor are multicasting stuff. And that stuff is hitting the door access controller. The door access controller doesn't give a darn about what is happening with the factory's machinery. It is as if the door access controller is a 90 year old man at an Aerosmith concert and he can't stand all the noise. So he closes his ears and can't hear the knock on the door either.

Multicasting 100

There is lots you can read about multicast on the web. I'm going to over simplify and analogize a lot here, so you network gurus realize that before flaming me with email corrections.

Packets fall into 3 different types - unicast, broadcast, and multicast.

Unicast are packets sent from one device to another device and nobody else should need to hear them. When you are copying a file from your workstation to your server, that data flows between the two systems and are sent via unicast.

Broadcast packets are supposed to go everywhere on that network segment. These are used when you don't know who the recipient should be and you are hoping they are there, listening, and will answer you. For example, when your system is powering up and first connecting to the network before it even has an IP address, broadcast packets are used to locate a server that can give you a specific IP address.

Multicast packets are like radio stations. They put their music out into the airwaves and whoever is interested in hearing their music tunes into that station. Except in order to tune in, the radio waves have to make it to your radio. In the old days, hubs treated multicast just like broadcast. Today, stupid switches still treat multicast like broadcast, sending the data everywhere and letting each system decide if it is going to listen or ignore that data.

Smarter switches, routers, and firewalls can watch for receivers that say they are tuning into a station and only send that data to stations that are interested in hearing that multicast channel. By only sending those signals to interested receivers, the traffic flows only where interested parties are listening. There are special network packets interested parties send out to tell the rest of the network they are tuning into one of those stations.

Multicasting 100.1

Skip this if you aren't interested in behind the scenes network stuff.

"So what is multicasting used for?"

Lets say I have a network connected camera pointed at the front door. All cameras are continuously recorded on a DVR (Feed #1) so if there is any issue they can review the video feed. That data is also sent to a second location that also records the video feed (#2) as a safeguard in case bad people break in and steal the DVR to hide their faces. The Mr. Finch at security station (#3) is interested in who is there, so they tune into the camera and data flows between camera and security station. A man in the suit buzzes the door, and says "I'm Reese, and I'm here to see the CEO."

Mr. Finch in security looks and doesn't recognize Mr. Reese, so he buzzes the CEO and says "Look at the front door - do you know him?" and the CEO connects up and sees video on feed #4.

Now this poor camera is sending the same video to 4 different places - the exact same video - and is taxing its own capabilities and the network infrastructure it is talking over.

With multicast, it puts out the data once on a multicast channel. Anyone interested "tunes in" to that address and hears / sees the network data. You can have a hundred viewers of the video and the camera doesn't send more data than when one person is watching.

Multicasting isn't just for video, but it is a great example. It can be used anytime multiple receivers might need to receive the exact same data at the same time.

Back to the problem

All this multicast data from the factory floor was hammering the door system. Being a very low traffic door system, it was actually built with 10 Mb/s hardware - not even fast Ethernet. So all this multicast traffic was flooding the port and it couldn't talk to the rest of its system to say "Yes, open the pod bay doors please Hal".

The solution is to get the multicast traffic to flow only where it needs to flow and not to every single device. In other words, stop treating multicast like it was a broadcast packet.

The factory devices were hooked together into a NetGear GS724T switch, and looking inside I found a place to configure it to intelligently handle multicast:

GS724T Multicast Configuration

IGMP Snooping means "Listen to clients and only send multicast packets out when someone tunes into that multicast channel"

At first, all three were set to Disabled. When disabled, multicast are treated like broadcast and sent everywhere all the time. I enabled two of the three options and we set about looking for the next bit of elephant to bite on.

(side note: We weren't sure if their network equipment on the factory floor could properly handle multicast. Plan B: was to wire the factory floor into an unused port of the firewall and do some filtering there, but we didn't need to as this was a smarter switch. I point this out as there are often multiple ways to solve these kinds of problems.)

More traffic!

Another thing the first network sniff revealed can be seen in these two screen shots:

Ping Packet

This shows what looks like the device 10.10.5.25 continuously sending broadcast pings everywhere.

Broadcast pings are bad. I've only seen a very few limited cases for a broadcast ping - asking every single device on the network to answer back, and then it is only for limited instances. On the far right of each packet, the TTL (Time To Live) is decrementing, so these look like they are bouncing off some network device that is decrementing the hop count and repeating the packet. The first packet has a MAC address that claims it is from a Cisco Linksys device, whereas subsequent packets:

have a MAC address that say it is from the Sonicwall firewall. This one had lots of question marks dancing over my head - Not sure if this was a funky reaction a device was having to the multicast traffic, if something was wrong with the device, if someone on the other end of the wireless was trying to view the multicast traffic and it was choking the wireless, or if the Sonicwall had a bug where it was reflecting traffic in a strange way.

So we took the access point down and looked at its traffic in isolation and nothing looked out of the ordinary. So we put it back up into its home and took another short sniff from the door system's port.

This was way less noisy, so I ran a Statistics->Conversations report which looked like this:

The top speaker was 10.10.200.126 (on the factory floor!) - and it is sending broadcast packets everywhere.

What is it? Looking at one of the packets:

So there is some kind of USB Plug and Play sound device manufactured by a company called C-Media Electronics Inc., and it might be called Net2VGA or DisplayLink. The amount of traffic this was generating was well below the rest of the traffic, so it might not matter for this particular problem. Not that this can't cause other issues - there is no good reason to be broadcasting data at this rate all over the network - it is either poor design or sloppy coding on someone's part.

We looked around for a rouge DHCP server and couldn't find any, so that problem might have to wait until it recurs.

Years ago, I'd instrumented some of their network equipment to report statistics to a central agent. Time and time again, when faced with unknowns, this tool and its data has been key in seeing at least where to go looking for a particular problem. In this case, it not only showed the global problem but that the global problem was indeed gone:

These show the network traffic on various ports of the factory's switch before 10:45 AM when the multicast traffic was quieted down and the Cisco WAP was brought down to see what it was doing, then put back up.

I left the site around Noon and monitored the network traffic graphs remotely for the next two weeks.

This concludes part 1. But that implies a part 2. Yes, there is more!

The half eaten elephant gets reanimated - Click Here!

 

 

Footer