2015-03-26

Troubleshooting Serial Communications

After two months of troubleshooting, countless gallons of coffee, endless hours of stress, several years off my life, and many square inches of cranial real estate lost to newfound baldness, I finally solved a longstanding, serious blocking bug with a single line of code. I don't know how useful the particulars might be to anyone who stumbles across this posting, but the thought process and methods used to narrow in on a solution might help other fledgling developers by giving them new angles from which to view problems. In any case, I want to document the process for myself, because it has been a very painful ride, and one I'd very much like to think I've learned from.

By way of context, I'm working on enabling two-way communication over serial COM ports using the Windows API. One application, written in C++, needs to communicate with another, written in C#. Both applications existed before this new use case was devised, and both applications were previously proven to successfully communicate over COM ports. However, when the two applications were set up to communicate with each other, the communications failed 100% of the time. I had to figure out why.

It's worth mentioning here that prior to finding myself attached to this project, I'd never been exposed to many of the concepts and technologies involved. I have some limited experience with C++, for instance, but not in a large-scale production environment. I've never worked with serial communications before, and never used the Windows API for C++, which is not as friendly to newcomers as other libraries can sometimes be. Some complexities were also encountered in the realm of threading, which I have some experience with, but not in the Windows environment. A big part of the reason this solution took two months is that I had to learn all of these concepts in parallel in order to start focusing in on the location and nature of the problem.

The first thing I did was to have someone walk me through the existing functionality and where the breakdown was occurring. It was explained and demonstrated to me that the client application, written in C++, was able to send a message to the server application, written in C#. The server reported successful receipt and parsing of the message, and reported a consistent attempt to respond. After that, there would be silence from both applications. The client reported that it was listening for a response from the server, but no response would arrive - or at least, the client did not report receiving it.

So began my adventure. The first thing I did was to look at the client side application and implement some kind of logging functionality. The existing code, having been whipped up in an apparent panic on a tight deadline, provided no logging - in fact, no output of any kind was visible to the user, and no comments or documentation were given in the code to help those who would later have to maintain or extend the application.

It took me several days just to implement logging, and there are a couple of reasons for this. First, as mentioned above I had never worked with C++ before, and learning the ropes on that front is not trivial, especially where file access and streams are concerned. But the bigger stumbling block in this particular case was the unorthodox use of the standard output buffer.

The original author of the client side software had been intercepting all standard output and copying it to a secondary buffer, where it was stored and later passed over a named pipe for use as input to an integrated service. Since there was no documentation of this method of message passing, it took many long, confusing hours and lots of help from more experienced developers for me to come to the understanding that every time I tried to use console output for debugging purposes, it was breaking integrated services because they were not expecting my debug messages to come in over the pipe! So console logging was out, and I had to do any and all logging strictly to disk (or to debug output in Visual Studio, where execution in debug mode was permissible), where the log file could later be read and investigated. Whew...

In any case, I did manage to get a logging feature set up and then it was off to the debugging phase. I set up logging output to report on program status throughout the execution of the client side code, as well as setting breakpoints to investigate the state of suspicious variables. This yielded some curious information.

The first thing I tried was to check the ReadFile(...) command on the client for errors. It turned out that the call to ReadFile(...) was returning error code 0, with a timeout. In other words, no errors were being thrown. The ReadFile(...) call was successful, and all arguments passed to it were valid, but nothing was being received before the given timeout parameter was met. I tried setting the timeout to infinite, and this resulted in the ReadFile(...) command looping, as you might expect, infinitely.

Since no error was being thrown, I assumed that the port was open, and that the client had successfully connected to it. This suspicion was reinforced by the fact that expected data had been verified arriving at the server, as explained above. However, just as a sanity check, I set breakpoints to observe the definition and state of the file handle used by the Windows API to represent and access the COM port itself. I verified, in this way, that the serial port handle carried the same definition and state when listening for a server response, as it did when it was sending a message to the server. As far as the client application was concerned, it was a simple matter of no data coming in over the seemingly valid port.

More sanity checks came over the following weeks. I started building simplified sandbox applications with the dual-purpose of learning, so I would feel more comfortable with the Windows API and COM port communications in general, and also to verify that the code worked as expected, both on the C++ and the C# sides. I built a simple C++ program whose only mission in life was to contact the server with a hardcoded, known-good message, and listen for a (and report the receipt of) a known-good response. It worked! This was my first sigh of relief, but didn't yield any immediate solution.

Keeping my momentum up, I built a simulated server with the same strategy in mind, just to ensure that there wasn't some idiosyncrasy in the existing server code that made it behave oddly. As expected, the simplified, standalone sandbox server I whipped up worked. My sandbox C++ client was able to contact my sandbox C# server over the expected COM port; the C# server responded over the same port; and the C++ client received and reported the server's response. Everything worked! Unfortunately, also as expected, the simple sandbox server also behaved exactly the same as the real server when used in conjunction with the real C++ client.

I felt I was back at square one. I had tried lots of things, and verified little more than the simple fact that everything should have been working! All the C++ code I wrote worked, and all the C# code worked. The ReadFile(...) and WriteFile(...) calls and even the CreateFile(...) calls, and all parameters passed to these functions, were identical - as far as I could tell. I even went so far as to duplicate the code I was using in the sandbox to build the COM port in the production application. This still did not avail.

Then (this morning) something interesting happened. I had been going back and forth between testing my sandbox apps and their production counterparts, and I realized that after running my sandbox app successfully any number of times, any failed execution of the production client permanently broke the COM port, even for future attempts at running the supposedly known-good sandbox app! This didn't make much sense to me, but I also stumbled across the fact that running a C# version of the sandbox client seemed to repair the COM port. Something was so dramatically different between the COM port setup in the C++ and C# applications that not only did the C# application run more reliably, it actually repaired the damage done by the production client, that the sandbox C++ client wasn't able to address on its own.

I did a side-by-side comparison of the code in the C++ and C# applications to see how they set up the COM port (yes, even though I had written both). I saw that the examples I had followed in building the C++ application had only set a few of the DCB parameters during initialization of the COM port. (DCB is a struct in the Windows API that contains all the data needed to manage a COM port.) It only set up the parameters needed to establish a COM port connection under known good conditions. Since this had been working for me under most test conditions, and it hadn't even occurred to me that there were parameters that weren't being set one way or another, I didn't think to look there. And it turned out that yes, there were DCB parameters that I hadn't even known were going uninitialized, because most of the time I didn't need them.

The C# application, on the other hand, had been developed based on a more thorough set of tutorials and examples, and included full initialization of all DCB parameters; most importantly, the DTR (Data Terminal Ready) and the RTS (Request to Send) flags. Setting both of these explicitly to enabled or handshake mode, rather than the default disabled mode, suddenly fixed everything.

Now I felt a broad spectrum of mixed emotions. On the one hand, I was elated that a two-month troubleshooting session had finally, finally, finally come to an end. The problem was solved, and all would rejoice. On the other hand, it took me two months to solve the problem, and there was no rejoicing in that, especially given the simplicity of the fix, which was not even a one-liner. I only had to add a couple of arguments to an existing function call, it was that simple.

No one else on the project seems terribly upset with me, though. After all, each time I asked for help, more experienced developers did not see the solution either, or have a particular idea of where to focus the search. All of us went on many wild goose chases trying to pin this problem down, and it's a relief to have it behind us. Also on the bright side, though the wild goose chases did not, for the most part, yield anything beyond validation of pre-existing suppositions, they did teach me a lot about new ways to troubleshoot, debug, and test assumptions to ensure that you do know what you think you know.

Thanks for reading!
- Steven Kitzes

No comments: