Unique solutions to uncommon system problems

Solutions

How Soussan’s Clients Describe His Problem Solving Abilities:

“I’ve convinced the Supplier Management Team and the S/W team that you can cure world hunger, the Bird Flu and prevent hurricanes from hitting the U.S. with a few simple lines of code and some testing.”

— Aerospace Industry Client, Oct 2005

“I was thinking of you the other day when I heard on the radio that the Mars rover, Spirit, was having problems – symptoms + power up, load the program, shut down, over and over. I thought to myself, ‘I bet David could fix that if they let him take a look at the system for a few minutes…’”

–– Government Transportation Project Manager, Jan 2004

Sampling of Soussan’s “Toolbox”

Throughout Soussan’s engineering career, he’s “touched so many tools that now it’s at such a critical mass, I can move pretty seamlessly from one program to the next.” Nonetheless, here is small sampling of what is in his toolbox:

Borland C++ Builder & many other C compilers
Borland Delphi
Microsoft Visual Basic (6, 4, & Office VBA, Visual Studio 2005, Visual Studio 2010, Visual Studio 2013)
Windows NT (3.5 through XP & Vista, NT server 4.0, Server 2000, Server 2003, Server 2008, SBS Server 2003, Server 2008, Server 2012)
Windows CE.NET
Solaris 7 & 8, administration & C, C++, Informix ESQL/C, Shell programming
Active Server Pages (ASP), ASP.net
OS-9, OS-9000, 68xxx and ARM processors
68HC11 & other 68xxx Assembly
Motorola DSP56002
Intel assembly from 8080 through Pentium architecture
Local & Wide area networking (physical connections, routing, DNS, mail pointers, etc.)
Wireless (Point to point, point to multipoint, microwave)
Microsoft Small Business Server 2003
Microsoft Exchange 2003, Exchange 2007, Exchange 2010
Microsoft SQL Server 2000, 2005, 2008
Crystal Reports - written > 1000 different reports
Network protocol analysis
Security scanning & network cleaning
Malware analysis & remediation - but you shouldn't clean anymore! See article in Cool Stuff link.

There are some very detailed 'how-to' articles I threw together in the 'Cool Stuff' link. They guide you through some very complex problems and cut to the solution for things I've seen others struggle with. Doesn't look as nice as these pages, but they are all meat with no potatoes. Plus they generate a lot of traffic, so I'll probably focus my spare time writing more of those and fewer case studies.

What follows are all high level problems and solutions that might be of interest.

Case Study: VI. One of the many problems with the World's Largest Intelligent Transportations System [ITS] (1996-2004)

Background

An Intelligent Transportation System, or ITS as it is referred to in the industry, is a system to bring travel and traffic data to a central location, analyze it, and disseminate that information back out to the motoring public. Usually this is done with a combination of technologies, such as road sensors, video cameras, electronic message signs, radio broadcasts (HAR: Highway Advisory Radio), a telephone automated response system (HAT: Highway Advisory Telephone), etc. The Company contracted to do this for the state was having lots of difficulties getting the project working. What started off as a 6 month contract turned into a 8 year career spanning re-design through maintenance.

There are many cases revolving around this project.

Challenge (2006)

During an expansion attempt by the new maintenance company they were trying to get video and camera control signals to work over radio based Ethernet equipment. Through many weeks and many false starts, they got video receiving but couldn't get the camera control to work right.

It started off working, then over time a slowly creeping delay would make camera control worse and worse. Eventually, you would hit a command like panning the camera right and a good 30 seconds later the camera would start panning right. You would let go of the joystick and the camera would continue panning right until 30 seconds had elapsed, then the camera would stop moving. I was no longer working regularly on the project, but having just a little bit of experience with the system I got the call for help.

Problem details

The camera control data stream is a 4800 baud RS422 continuous stream of characters. When nothing is happening, sync sequences are continuously sent and the camera visually displays if it sees those synchronizing sequences or not.

Unfortunately, Ethernet (and other packet based data transmission methods) aren't designed to send a continuous and uninterrupted stream of data. They build up characters (bytes) into larger quantities called packets and send those a big chunk at a time. This design minimizes the cost of the overhead associated with sending packets, which is good. It also breaks up the timing of continuous data into bursts, which is bad if you need continuous streams of data.

There are many ways around this, some of which can be done within the firmware of various data conversion devices. When I arrived on site, they were experimenting with protocol converters and serial-to-Ethernet converters which added ~$850 to the cost of each fielded camera not to mention the added points of failure and maintenance required.

Soussan’s Solution

I was able to confirm the creeping data delay pretty quickly, then managed to make things run a little better by changing some of the firmware and configuration settings inside the serial-to-Ethernet converters. But I still wasn't happy with the results as control delays, while significantly better, were still slowly creeping higher. Plus, I hate throwing hardware at a problem that really isn't a hardware problem.

Having reverse engineered the camera control protocol years ago, I already had a PC with serial ports that interpreted those camera commands. Plus, it was already both on the network and had a spare serial output which the video CODECs could use for camera control. Those were originally tried before the external Ethernet-to-serial boxes were tried and those were even worse, so the client company had abandoned that method.

I resurrected it.

With about 25 lines of Visual Basic code, I was able to send both the synchronization sequences and camera control commands out to the CODECs via the serial port for just the cameras that were using the radio based Ethernet. By not always sending the sync sequences, that eliminated all the continuous data that was buffering up and causing control delays.

And just to put some icing on the cake, each camera link now cost $850 less in hardware and 4 fewer power supplies for those boxes, fewer cables & connections, etc. Multiply that by the 20 or so new cameras, and add in the labor to configure it all, test, and maintain for the next 10-20 years... I'm no accountant, but that is some pretty good ROI for a few days of work.

Case Study: V. One of the many problems with the World's Largest Intelligent Transportations System [ITS] (1996-2004)

Challenge (1998)

The design called for video, data, and camera control to be available from another location which would be implemented by an "intertie" between the two sites. The original engineers designed the interite as a 19.2 Kb/s serial data link over copper, fiber and lastly microwave. Creating the software that would send data over the link was estimated at 6 months and would involve a lot of custom protocol design work to translate the database access into a serial stream to send over the link. Video was on its own microwave subcarrier with the serial data stream for camera control riding alongside in yet another 19.2 Kb/s link.

Problem details

What was wrong with this design centered around all the custom work to get the traffic data flowing through a serial link. When calculated out, the bandwidth provided by the serial link would allow only 5 minute updates. Worse than that would be designing and coding how to transfer the data over a serial link.

Soussan’s Solution

Looking over specifications for the existing equipment, another engineer and I allocated some existing bandwidth from the video encoder / decoder to provide T1 bandwidth on one of the links. The microwave had more than enough bandwidth to support that same signal, so it was modified as well. Lastly, two Ascend Pipeline routers were acquired to accomplish the Ethernet to T1 and back routing. All the small serial data streams were coalesced into one big T1.

The entire setup was prototyped during a week of holiday shutdown, with all the communications equipment between the two sites simulated by a T1 crossover cable.

Now instead of 19.2 Kb/s bandwidth between the two sites, there was a 1.5 Mb/s between both sites. Plus, there was very standard TCP/IP connectivity as the two sites were different networks connected by a private link. This allowed standard Visual Basic and ODBC connectivity between two databases instead of custom serial port communications. This cut the development effort down to 1-2 months once the link was proven. Camera control was set over a TCP socket to a different VB application.

"Design for Manufacturing" applies often in the world of physical products. It is sadly absent from many software and systems designs. Intimate understanding of multiple technologies allowed the redesign to simultaneously cut months out of the schedule and increasing bandwidth on the link by many orders of magnitude.

The installation was relatively seamless. The blue cable representing all the T1 stuff between the two sites was replaced with the equipment at each site, one connection problem was resolved, and the entire link came up functional the same day.

Case Study: IV. One of the many problems with the World's Largest Intelligent Transportations System [ITS] (1996-2004)

Challenge (1999)

The communications for expressway segments are broken down into regions. When first started, the system works fine. Then about 24 hours later communications with the messages signs in one region become intermittent, then stop working completely. About 8 hours after that, most of the sites generating traffic data are also reporting errors.

Problem details

Detailed analysis of the round trip message travel time revealed the initial round trip communications travel time to the field devices and back at 0.25 seconds. 24 hours later, that time increased to 0.4 seconds. it would max out at just over 0.6 seconds, which exceeded the system's design limits for communications round trip times. Increasing the time the system waits for responses is not an option as there were too many devices to communicate with every minute if the timeouts were set too high.

To diagnose, the system was allowed to degrade to its longest delay possible, then round-trip travel times were measured at various points in the system. The root cause ended up being the Cylink (now PCom) 900 MHz Spread Spectrum radio, which was the device used to get data from the tower to the individual field sites.

Contact with Cylink was established, who knew of the bug in the product. It was a 'creeping buffer delay' which degraded over time, and the product which was still being manufactured but not enhanced anymore wasn't going to be fixed. Ever.

"Buy our new product, it doesn't have that problem" was their solution.

There were 180 or so radios in the field. At $2000 each, that would cost $360,000 to fix just for the parts, not counting the labor. Or discovering what other strange anomalies the new product had and how to fix or work around those.

Soussan’s Solution

With more analysis, the problem was isolated to the "master" radio, which is the radio at the tower that communicates with all the "slave" radios at the field sites. With the system fully operational but in the high delay failing state, various tests were performed to see if the delay could be eliminated from the communications path.

We found completely resetting the master radio by removing and reapplying power would eliminate the delay. It would still creep up over the next 24-36 hours, but the initial delay was reset back to its 'best case' operating conditions.

A trip to Radio Shack yielded a digital light timer which was set to "turn off" the lights at 4:00 AM, then turn them back on at 4:01 AM every day. That timer was attached to the power circuit of the master Cylink radio. The timer is made by Intermatic and OEMmed to Radio Shack. The same solution has solved other problems where a quick reset on a regular schedule is required.

Now, every 24 hours the delays are reset back to zero, start creeping back up, and are brought back to zero before the delays are high enough to cause problems.

Total cost of this solution was $19.95. Plus sales tax.

Which is a whole lot cheaper than replacing all the radios for $360K plus labor!

Case Study: III. Building the World's Largest Intelligent Transportations System [ITS] (1996-2004)

Background

Same as Case Study IV as this is the same project.

Challenge (1997)

Company1 can't figure out why one data path across ~9 miles of a fiber ring and 16 cabinets along the expressway wasn't working reliably. Other engineers assigned to the problem came up with no explanation. A non-technical CEO of company2 interested in buying that division of company1 needs to understand the level of problems present in the project implementation schedule. This was more a people & communications challenge than a technical challenge.

Soussan’s Solution

With a little diagnosis and research, the equipment chosen was never designed to be connected up into a ring configuration. Instead, it was designed for a central point and two spokes outward, with the tips of the spokes never touching each other. The reason the ring configuration was working at all was there were some intermittent breaks in the ring, thus making it look like two spokes of a wheel. When the ring was good, the equipment didn't work right; when the ring was broken, the equipment worked fine.

By changing out the cabinet equipment for devices designed to be connected in a ring manner and fixing the intermittent fiber problems (cleaning, connectors, and some fiber breaks), that particular communications system came up functional. The original designers didn't specify the correct equipment for the design they had created. Nobody else working on the project could figure this out.

Company1 was selling the division charged with creating the ITS system. A potential buyer interviewed everyone involved on the project; among their concerns were meeting the scheduled delivery dates, costs associated with the project, and future warranty liabilities once the system is accepted. With the CEO conducting the interviews personally, the challenge became explaining in terms upper management understands the depth and scope of very highly technical issues and how it will impact the future liabilities about to be purchased.

It was the mid-spring and I'd been hired as a consultant--actually working on the project, not just advising--for 6 months. The potential new CEO asked "So, how is the project coming along? Are you going to make the scheduled deliverable in December?"

"I've got a paycheck of mine against a paycheck of yours that says this team is going to miss that date and going to miss it badly. In fact, if you doubled the staff and everything else goes perfectly, you'll be lucky to come in at just a year over schedule."

The CEO looked like I'd just hit him over the head with a two-by-four. Over the next half hour, we explored my statement in detail. I'd gone over the problem just discovered and fixed, why it was a design level problem, why design problems cost 100 times as much to fix once they are fielded, all de-technobabbled. Then projected out the other data paths that weren't yet working and that "why they weren't working" was still unknown. What kinds of skills it takes to debug those kinds of problems.

Worst of all, the entire project was currently prototyped in the field, whereas it should have been prototyped in the lab.

Seven years later over dinner with the CEO that did buy the company but protected themselves from buying the significant liabilities of this project I'd asked "When you interviewed everyone, I know I shocked you with my view of the schedule. What did the rest of the engineers have to say about that same question?"

"They hemmed and hawed, said things were mostly OK, and they'd probably not make December but would be a couple of months late. Nobody said what you said or as strongly, but you were right."

Case Study: II. The Best Design Never Implemented (1990)

Challenge

Company needed to update the software in thousands of emission analyzers in the field. The first updates were handled by field technicians visiting every unit, opening it up, plugging a case with a secured hard drive inside, and waiting 20 minutes for the software update to complete. This was cost prohibitive as a long term solution.

Soussan’s Solution

This testers all had internal 9600 Bits per Second modems installed for sending data to the various state agencies interested in the smog test results. During an interview with Motorola, David was asked "What was the best design you ever had that wasn't implemented?" and this was his answer:

Back in the 1980s there was a commercial for a hair care product that went something like "... you'll love it so much you'll tell two friends. And they'll tell two friends. And they'll tell two friends. And so on, and so on, and so on." each time showing all those people in smaller square boxes on the TV.

The idea was to divide up all the phone numbers into local calling zone groups, make lists, then "We update two testers ... and they update two testers... and so on, and so on, ..." Given the dial and transfer time to update one tester was about two hours and there was an 8 hour window when the testers would answer an incoming phone call, I calculated in 4 nights best case 32,768 testers could be updated. After each update, they would "Phone home" to an 800 number to check themselves off the list. After a week, any failures could be re-seeded, retried, and visited only if necessary thus saving thousands of field tech visits.

The company could have sent $5 to each service station to cover any phone charges and still come out way ahead.

In the end, the company wanted to maintain all the dialing and control locally, so the second solution discussed in the case below was implemented. But still, it is one of the best ideas that never got implemented. Keep in mind, this was all pre-public internet.

Case Study: I. Automated Auto Emissions Testers (1989)

Challenge

Company was committed to providing state governments aggregated data from all fielded auto emission testers in four different states so states could prove compliance with federal air quality regulations and receive federal highway funds. The company had stopped providing the data due to existing PDP-11 based system for reading data tapes did not work properly. States were about to take the company to court for major damages per contractual commitments. Internal company IT department said they could do software & hardware for one state for $800K in hardware, four software engineers and six months. Each additional state would cost an additional, unspecified amount.

Soussan’s Solution

Company VP of engineering called Soussan and asked, “What can you do?” After a couple of days of analysis and design, Soussan offered, “I can do that project in three months with one other software engineer and $50 K in hardware.” Soussan got the project. End results were on time, under budget and the output complied 100% with each state’s data requirements, averting the lawsuits.

In the old system, the cycle time for loading tapes was three weeks per state per month. The new system that Soussan set up was 80386 PC based, written in C using a custom, time slice, multi-processor that used multiple serial port hardware to parallel operate eight tape drives simultaneously and independently. After the system was designed and tested, the usage was a bit awkward as the user was constantly going to a tape drive then to keyboard when loading/unloading a tape. Custom hardware modifications to the tape drive allowed the PC to sense the tape’s motor, and software intelligence made the tape loading process into a race between the operator and eight tape drivers. The new system allowed one person in six hours to completely read all tapes for one state.

On a different project, fielded testers required site visits by technicians in order to update software. With 10,000+ testers fielded, this was a considerable expense.

To reduce these expenses, David led the team in the creation of a distributed processing client server automatic dial and update system. Each PC in the engineering department would take a phone number from a central server, attempt to connect to the remote tester, upload new software, verify that the upload worked, and report the status back to the server. The testers automatically woke up every night and accepted incoming calls between the pre-set time periods; the engineering department’s machines woke up and automatically updated them. After a set number of failed attempts, field service was dispatched out to non-updated sites for manually updating and resolving why they could not connect. Ninety-five percent of software updates no longer required a site visit. Of the ~4000 testers sold to check stations in one state, 3800 were updated in one week’s time for the cost of some long-distance charges.

This was pre-public internet, when the best modem speed was 9600 baud!