Symptoms of the problem
I was at a new client's network and doing some analysis
trying to figure out why there were intermittent network issues.
My laptop seemed to be worse exhibiting intermittent connections
out to the internet VS. locally - here is a snip of me pinging
two hops away from the client's network:
Sometimes it works, sometimes it doesn't. When the network
looks broken, it stays that way for 15-45 seconds, then comes
back. If you were hitting a web site, it would appear like the
site was offline, then online, then offline, etc. When sending
an email, sometimes it would send, sometimes it would take a
minute before it would send. In essence, "The internet is flakey
and often goes down."
When Comcast tech support checks, they see no problems at all on the
circuit - "...everything was perfect, sorry, the problem must be on your
end. Have you tried turning it off and turning it on again?"
During the times the internet appears to not work, the local network
is fine - you can still see the drive shares, printers, wireless
devices, etc. so symptomatically it walks, talks, and acts like an
internet flake-out.
Sniffing the network
If you've read any of the other articles in the Cool Stuff
section, probably 30% of them show the data as it appears on the
network - known in the industry as a "network sniff". The tool I
used in this case is Wireshark, so this is about to get very
technical about what data is flowing on the wire and how the
devices are reacting to those packets. This is going to get deep
- if you own a propeller hat, now is the time to put it on.
The Comcast / XFinity router in question is a Cisco DPC3941B though
it has a brand "technicolor" on the back. The exact model number is
"DPC3941 - BMCV214 - K9" From what web searching I did I found other
Cisco routers exhibiting the same symptoms. That router is at
192.168.19.254. My laptop is at 192.168.19.122 - and the first clue
shows up right away in the network sniff:
(most all pictures are thumbnails - click to see them full size)
If you expand that you'll see Wireshark is complaining "duplicate use
of 192.168.19.254 detected!"
Expert diagnostic tip: When you encounter a duplicate IP address, the
easiest way to locate the device is to open up a continuous ping window,
disconnect the device you know is sitting at that IP address, clear your
ARP table, and see the IP address re-appear echoing pings... then check
the ARP table in various switches to track down which port sees that IP
/ MAC address.
Another way to do it when you have stupid switches or can't get
administrative access to the smarter switches is to open up a continuous
ping window and start unplugging / plugging cables back in watching for
which cable stops the pings. This is invasive - make sure it is OK to be
network disruptive first!
In this case, since it was a small office I disconnected the main
router which I knew was at 192.168.19.254, cleared the ARP cache with
the "arp -d" command, and ... my ping window never echoed any pings.
Which didn't make ANY sense - where was the other device that was
sitting on the same IP address?
So instead of plugging into the main switch, I plugged into JUST the
comcast router and had nothing else connected to it except for Comcast's
cable. Sniffing, I saw that same error message.
A HUGE CLUE!
I really didn't believe what I was seeing, so as the only
device plugged into that Cisco / Comcast / Xfinity / Technicolor
router and running the ping window I dumped the ARP table with
the network working and not working:
Something is changing the MAC address associated with the router!
When it works, the MAC address was 3a:17:other stuff shown above. When
it didn't work, the MAC address was "00:05:04:03:02:01" (or
"00-05-04-03-02-01" when written with dashes). When the MAC address is
interpreted, it looks like "NarayInf_03:02:01".
Some web searching on that mac address gave me only 4 pages of hits,
one says it is the mac address of the Cisco ASA firewall:
https://forums.xfinity.com/t5/Your-Home-Network/unknown-node-on-my-LAN/td-p/2407079
but doesn't say much more about it.
Here is a wireshark snip showing the pink section at the top - three
different pings all getting responses whereas the pinks on the bottom
are all echo requests which are never answered... and something between
is the culprit.
Let us look at the ping that does echo and the one that
doesn't echo - here is packet 847:
and packet 913:
So packets sent to MAC address 3a:17:e1:e6:ee:db (the router's
correct address) are properly forwarded to their destination and the
ping echoes. Packets sent to MAC address 00:05:04:03:02:01 are swallowed
by the router and don't go anywhere!
What triggers this?
The router's MAC address 00:05:04:03:02:01 asks for the MAC address
of my computer:
This shows 192.168.19.254 at the funky MAC address asking where
192.168.19.22 is. The reply is what screws everything up:
My laptop answered the request with "192.168.19.122 is here! Find me
at MAC address dc:4a:3e:17:08:94!"
My laptop saw this reply came from 192.168.19.254 at MAC address
00:05:04:03:02:01 and decided to change the MAC address
At this point as far as anyone trying to access the internet from my
laptop was concerned, "the internet is broken"
(side question on the off chance someone from Cisco or Comcast
happens to read this and cares to share an answer: Why is one device
advertising itself with two MAC addresses? And if it is going to
advertise itself with a second MAC address, why is it acting differently
depending on which MAC address the packets are sent to? I think this is
a bug, but I don't know enough about the internals to support that
statement.)
Let me throw another data point - it doesn't matter if you are
talking wireless or wired, the router acts the same way as does the
computer exhibiting this symptom.
As for triggering, I saw this right after the ARP / Reply
consistently after the internet went out:
I'd heard of
SSDP but never really looked into it - lots has been written on the
whole Universal Plug-And-Play (UPnP) and the problems it can be the root
of, so I really don't want to repeat all that here. Feel free research
elsewhere.
Tracing who is consuming what in the computer
Drilling into the SSDP world a bit the traffic happens on
port 1900 - so to see if this was indeed what the computer was
listening and reacting badly to I had to see where in the system
those packets landed and test if shutting it down made the
internet less flakey or not.
The netstat command is one way to see which processes are listening
to which ports:
(I took this sniff while writing the article and not connected to the
client's network, so that first line that starts UDP doesn't show my
address at their site but my address when hooked to the Microsoft
corporate network)
I could also use the Sysinternals tool TCPView which is a more live
view tool than netstat is.
The right column says it is process ID 2912, which I turned to
sysinternals tool Process Explorer and confirmed which svchost.exe
process was hosting SSDP and its dependents:
As a test, I turned off SSDP while connected to the client network,
pinging and watching the replies and sniffing to see the data on the
wire. The problem got better but recurred as SSDP when stopped restarted
itself on its own - possibly due to stimulation from the network, or
maybe it just wants to run. Not sure yet. I stopped it again, it started
again, so I ran services.msc disabling it and its dependents:
One SSDP Discovery service was down hard (aka: disabled), the problem
was completely gone.
DO NOT JUST HARD DISABLE THIS SERVICE!!
Or, to put it differently, "Kids, don't try this at home!"
What I did was a diagnostic step to trace down to the root of the
problem ... which might end up being the magic fix, it might not be.
As of the moment where I wrote this line of the article, I don't have
the root problem & fix yet.
I do know there are systems that are on this network that are not
reacting badly to these packets, so this is not a global problem.
I also know I have tons on my system that aren't typically installed
on most computers.
More to come (hopefully)
This isn't a complete article yet - it is a work in progress.
I have the client problem solved for their needs and there are
other things I need to work on for them, so this not being an
active problem anymore lowers its priority and thus how much
time I can dedicate to solving it to a 100% conclusion.
But it is still a problem and thus won't let me put it down so I'll
likely poke at this in my own / spare time until I can explain it
completely.
If you've seen something like this, I'd love to hear about it - there
is a pattern and more data will help bring that pattern to light.
You can reach me at the link on the contact page. Please reach out! I
respond to everyone that isn't a spammer.
Thanks!
David Soussan
Copyright (C) 2018 DAS Computer Consultants, LTD. All rights
reserved. |