As the factory needed jacks, they added switches - now the whole factory is
networked along with a collection of
offices and cubicles each with their own PC, various wireless network
access points, a security camera system with remote monitoring
capabilities, access card readers on various doors, a network circuit to
another location many states away, and racks of servers in their server
room. It is a moderately complex network with under 500 devices total.
The factory folks have their own IP address range with their IP address
block static. Same with the servers, WAPs, doors, cameras, etc. The PCs
are all use DHCP. They've kept pretty good documentation of what devices
are at what IP and MAC addresses, and poor documentation of the
factory's switches, wiring, and equipment distribution on the floor -
that is "their area", though IT gets called when something didn't just
plug in and work.
That is a brief background to launch into....
The problem:
I got the call one day: "Our door access card system keeps going off-line. Plus, people
randomly get timeouts when going to websites when we know the web is
working fine, though AT&T says we need to upgrade to fiber." (they
currently have two T1s bonded). "Sometimes we can't get onto the
wireless from the factory floor. The other day, someone couldn't get
onto the network and they had some wild IP address, and we don't know
where that came from. One of the factory machines sometimes can't print
to its network printer, and resetting the machine fixes it. Things are
just flakey - help!"
I've told the laffy taffy joke before ... "How do you eat an
elephant?"
"One bite at a time."
So I started with the most reliable and repeatable failure, which was
the door access system. Here is a snip from the door access system's log
screen:
If you try to get into a door when it is disconnected, you don't get
in. Being locked out is bad! Especially when it is snowing!
This happens all the time, at random times throughout the day.
Sometimes continuously. Sometimes it works just fine.
Visual
analysis
From the computer that generated the screen above, I opened up a
command prompt and started 'ping -t' command to the door controller's IP
address. Lots of missed pings, lots of high echo times.
I opened up the cabinet for the controller and saw LEDs for link and
activity. Link was solid on (good!) and activity was solid on (Hmmm...)
Some devices like older 3Com switches (3300) have an activity LED
that is on when traffic is present, off when it isn't. Solid on means
solid traffic, not lit means quiet as a mouse. Other devices (some of the older Cisco switches, the 2900
series comes to mind) have LEDs that blink at a fixed frequency whenever
any traffic is seen and give no other indication how much traffic is
present.
This door system was solid on, and unfortunately I didn't know enough
about it to know how its LED behaved.
Yet.
If you've read the botnet article, or watched the webcast, or talked
to me about your database performance, you know I'm a big fan of looking
at the network data to figure out where to go look for a problem that
can hide in hundreds of different places. So I did what has been my "big
problem, where do I look?" diagnostic step yet again.
Pardon me while I whip this out:
Front view, with obligatory blinking LEDs and ...
Rear view. This is my network hub. There are many like it, but this
one is mine. My network hub is my best friend. It is my life. I must
master it as I master my life...
You can read about hubs and switches and stuff in the
supplemental information for the webcast here.
But this little "2 for $10 on closeout" hub and I have been together for
so many years I had to give it a little spotlight. In fact, I bought 4
of them.
Anyway, hooking the hub in at the door system along with a notebook running Wireshark to
capture a few packets, this is what I saw right away:
If that doesn't that scream at you what the problem is, then you
probably don't own your own propeller hat, and that is OK - I'll loan
you mine. Look
carefully down the column "Destination" and look for a pattern. The
first number of the destination address is 239, which is a multicast
address. The sources are all over the place - 10.10.200.x is the range
IT gave to the factory to use for their equipment.
So lots of devices on the factory floor are multicasting stuff. And
that stuff is hitting the door access controller. The door access
controller doesn't give a darn about what is happening with the
factory's machinery. It is as if the door access controller is a 90
year old man at an Aerosmith concert and he can't stand all the noise.
So he closes his ears and can't hear the knock on the door either.
Multicasting 100
There is lots you can read about multicast on the web. I'm going to
over simplify and analogize a lot here, so you network gurus realize
that before flaming me with email corrections.
Packets fall into 3 different types - unicast, broadcast, and
multicast.
Unicast are packets sent from one device to another device and nobody
else should need to hear them. When you are copying a file from your
workstation to your server, that data flows between the two systems and
are sent via unicast.
Broadcast packets are supposed to go everywhere on that network
segment. These are used when you don't know who the recipient should be
and you are hoping they are there, listening, and will answer you. For
example, when your system is powering up and first connecting to the
network before it even has an IP address, broadcast packets are used to
locate a server that can give you a specific IP address. That is known
as "ARP"ing for an IP address, or Address Resolution Protocol.
Multicast packets are like radio stations. They put their music out
into the airwaves and whoever is interested in hearing their music tunes
into that station. Except in order to tune in, the radio waves have to
make it to your radio. In the old days, hubs treated multicast just like
broadcast. Today, stupid switches still treat multicast like broadcast,
sending the data everywhere and letting each system decide if it is
going to listen or ignore that data.
Smarter switches, routers, and firewalls can watch for receivers that
say they are tuning into a station and only send that data to stations
that are interested in hearing that multicast channel. By only sending
those signals to interested receivers, the traffic flows only where
interested parties are listening. There are special network packets
interested parties send out to tell the rest of the network they are
tuning into one of those stations. IGMP Joins or Membership come into
play hear, and the opposite is an IGMP Leave message when you aren't
interested in the stream anymore. This is scratching the surface.
Multicasting 100.1
Skip this if you aren't interested in behind the scenes network
stuff.
"So what is multicasting used for?"
Lets say I have a network connected camera pointed at the front door.
All cameras are continuously recorded on a DVR (Feed #1) so if there is
any issue they can review the video feed. That data is also sent to a
second location that also records the video feed (#2) as a safeguard in
case bad people break in and steal the DVR to hide their faces. The Mr.
Finch at security
station (#3) is interested in who is there, so they tune into the camera
and data flows between camera and security station. A man in the suit
buzzes the door, and says "I'm Reese, and I'm here to see the CEO."
Mr. Finch in security looks and doesn't recognize Mr. Reese, so he buzzes the CEO
and says "Look at the front door - do you know him?" and the CEO
connects up and sees video on feed #4.
Now this poor camera is sending the same video to 4 different places
- the exact same video - and is taxing its own capabilities and the
network infrastructure it is talking over.
With multicast, it puts out the data once on a multicast channel.
Anyone interested "tunes in" to that address and hears / sees the
network data. You can have a hundred viewers of the video and the camera
doesn't send more data than when one person is watching.
Multicasting isn't just for video, but it is a great example. It can
be used anytime multiple receivers might need to receive the exact same
data at the same time.
Back to
the problem
All this multicast data from the factory floor was hammering the door
system. Being a very low traffic door system, it was actually built with
10 Mb/s hardware - not even fast Ethernet. So all this multicast traffic
was flooding the port and it couldn't talk to the rest of its system to
say "Yes, open the pod bay doors please Hal".
The solution is to get the multicast traffic to flow only where it
needs to flow and not to every single device. In other words, stop
treating multicast like it was a broadcast packet.
The factory devices were hooked together into a NetGear GS724T
switch, and looking inside I found a place to configure it to
intelligently handle multicast:
IGMP Snooping means "Listen to clients and only send multicast
packets out when someone tunes into that multicast channel"
At first, all three were set to Disabled. When disabled, multicast
are treated like broadcast and sent everywhere all the time. I enabled two of the three
options and we set about looking for the next bit of elephant to bite
on.
(side note: We weren't sure if their network equipment on the factory
floor could properly handle multicast. Plan B: was to wire the factory
floor into an unused port of the firewall and do some filtering there,
but we didn't need to as this was a smarter switch. I point this out
as there are often multiple ways to solve these kinds of problems.)
More
traffic!
Another thing the first network sniff revealed can be seen in these
two screen shots:
This shows what looks like the device 10.10.5.25 continuously sending
broadcast pings everywhere.
Broadcast pings are bad. I've only seen a very few limited cases for
a broadcast ping - asking every single device on the network to answer
back, and then it is only for limited instances. On the far right of
each packet, the TTL (Time To Live) is decrementing, so these look like
they are bouncing off some network device that is decrementing the hop
count and repeating the packet. The first packet has a MAC address that
claims it is from a Cisco Linksys device, whereas subsequent packets:
have a MAC address that say it is from the Sonicwall firewall. This
one had lots of question marks dancing over my head - Not sure if this
was a funky reaction a device was having to the multicast traffic, if
something was wrong with the device, if someone on the other end of the
wireless was trying to view the multicast traffic and it was choking the
wireless, or if the Sonicwall had a bug where it was reflecting traffic
in a strange way.
So we took the access point down and looked at its traffic in
isolation and nothing looked out of the ordinary. So we put it back up
into its home and took another short sniff from the door system's port.
This was way less noisy, so I ran a Statistics->Conversations report
which looked like this:
The top speaker was 10.10.200.126 (on the factory floor!) - and it is
sending broadcast packets everywhere.
What is it? Looking at one of the packets:
So there is some kind of USB Plug and Play sound device manufactured
by a company called C-Media Electronics Inc., and it might be called
Net2VGA or DisplayLink. The amount of traffic this was generating was
well below the rest of the traffic, so it might not matter for this
particular problem. Not that this can't cause other issues - there is no
good reason to be broadcasting data at this rate all over the network -
it is either poor design or sloppy coding on someone's part.
We looked around for a rouge DHCP server and couldn't find any, so
that problem might have to wait until it recurs.
Years ago, I'd instrumented some of their network equipment to report
statistics to a central agent. Time and time again, when faced with
unknowns, this tool and its data has been key in seeing at
least where to go looking for a particular problem. In this case, it not
only showed the global problem but that the global problem was indeed
gone:
These show the network traffic on various ports of the factory's
switch before 10:45 AM when the multicast traffic was quieted down and
the Cisco WAP was brought down to see what it was doing, then put back
up.
I left the site around Noon and monitored the network traffic graphs
remotely for the next two weeks.
This concludes part 1. But that implies a part 2. Yes, there is more!
The half
eaten
elephant gets reanimated - Click Here!
|