When something seemingly designed well still has a problem.

Entry posted by Mira Yurizaki May 5, 2018

808 views

With yet another security bug found on processors, one has to think how anyone would've let this through for this long. People would like to think there's incompetent engineering out there and while sure, they exist, what people also don't see are the designs that even you would agree with all of the knowledge and experience in the world that seems sound without experiencing it in the real world. So I have an example of such. This one I love to share, partly because pride (I was a junior developer who found a bug in senior developer designed code, showing that even people with 5-10 years experience can make mistakes), partly because this illustrates the point well.

A description of the system

I was working on a system that comprised of a main controller unit and several wireless sensors. We had a rule with wireless communication in that we had to assume it's unreliable, even if 99.9999% of the time it appears reliable. This required that if a device transmits something, the recipient had to acknowledge it, or send an ACK. If the transmitter doesn't receive an ACK within some time, it'll retry sending the message. If three retries happened, then the device gives up on sending the message.

To handle this in software, we used a state machine. I forget the exact details, but this is what it looks like more or less on the transmitter side.

This particular style of state machine is called a hierarchical state machine. The lines with arrows represent state transition with the event that triggers it.

The default state is "Tx Idle"
When it gets a "send message" trigger, it transitions to the "Tx Busy" state.
After the message is sent, it goes to the "Waiting for ACK" state.
- This is a sub-state of the "Tx Busy" state because until the last request has been ACK'd, the transmitter won't transmit another message.
If another message request comes in while in the "Tx Busy" state or its substates, it gets queued.
If an ACK wasn't received in time, it moves up to the "Tx Busy" state again as the message is sent again.
If an ACK was received or the message was retried enough times, it goes back to the "Tx Idle" state.
If the system needs to send an ACK for any reason, the system immediately moves to the "Tx Busy" state.
- I forget the exact detail of this mechanism, but sending an ACK basically had priority over anything else this thing was doing.

A buffer was included to queue up any messages (except ACKs) that needed to be sent if the hardware transmitter was busy sending something.

The problem: The message queue gets too full and breaks

So the problem started when a project manager working on the system with us was doing a random test on his own. The system had 8 nodes that needed to transmit and receive data back and forth between a main unit. He invoked all of the nodes causing them to flood the main unit with messages that needed to be handled. If he did this long enough, the system would basically stop and "hang." There was a queue for requests in the state machine and if another one comes in but the queue was full, it'd trigger this behavior. Not that it was bug (i.e., hitting some overflow case), it basically failed an assertion check

My investigation led to the cause being that the number of requests coming into the transmission queue was outpacing how fast this state machine could go through it.

While I'll go over what happened, I want you to think about what the solution would be. You don't have to make a comment but stew on it. Just so you're not going blind, here are the parameters you'll be working with:

The hardware this ran on at the time was an OMAP 3430. For those that don't know their SoCs, this was the same one that powered the Motorola Droid
The devices connect through a ZigBee wireless network system. Unlike say Wi-Fi, ZigBee uses a mesh topology. This allows a device to only send data to the closest one, which will then send it to the next closest one until the ZigBee coordinator (the equivalent of a router in Wi-Fi) is within range.
The ZigBee coordinator is within the main controller unit and communicates to the main board over a serial line at 115200 baud (or about 115.2 kbps)
The messages were at most 300 or so bytes in length.
Retry time is 100 milliseconds.
At the time this problem happens, the system appears more or less fine (i.e., retries aren't piling up)

Spoiler

If you thought...

"Increase the size of the message queue", then that won't work. Why? As my lead said: if you're overflowing your transmission queue, increasing its size only delays the problem from happening.
"The hardware isn't actually capable," this doesn't make sense either. The application isn't that complicated and frankly, the hardware was overkill for the app it was running. But the hardware itself had to do other things so that's what they went with.
"The transmission rate is too slow", the ZigBee network isn't exactly that fast and if the transmission speed was too slow, the network would be choking. But it wasn't as we weren't detecting anything odd with the network, and the ZigBee coordinator was fine.

Note I understand you don't have access to a complete understanding, so don't take the responses here too personally.

The root cause: There's an issue with the ACKing system

The problem lies with the priority need for ACKs to be transmitted. The reason for having a "Tx Busy" state in the first place is not really as a courtesy, but that the serial lines are asynchronous. That is, once we fed the serial line some data and how much of it there is, it'll take care of the rest and the application is free to do other things. The state machine is waiting for the serial line to say "Okay, I'm done" before moving to the next state. However, whenever a "send ACK" request comes it, it gets sent regardless of what's going on.

Because of the way ACKs are short cutting the process, they are constantly keeping the serial line busy. This unintentionally can introduce a stall in the state machine where it never gets to the "Waiting for ACK" state. Or rather it gets there, but it's constantly pulled away from it. To put in another way, let's say you need to talk to someone, but there are other people who have higher priority than you who are allowed to interrupt you whenever. So whenever one of these higher priority people come in, they butt you out, speak to the person, and leave. But there's a ton of these people, and eventually your request never gets served (and you'll feel like punching one of these higher priority people).

(Note: I don't recall the exact way the serial line behavior was on the main unit, so there's some holes in the explanation here that I can't answer)

The solution is to deffer all transmission requests until the transmission is completed. So now the state machine looks something like this:

The fun part was the original state machine was also used in a few other places where some sort of communication with another device was happening. As you can imagine, this fix had to be propagated to various other parts of the system. And not only that, but we already had documentation with these state machine diagrams and such, so those had to be updated.

So remember: just because something looks sound, doesn't mean it's bulletproof. If you want to critique a huge issue cropping up, you're free to do so as long as you understand most of the time, these things go overlooked because they're not readily visible.

TopHatProductions115 and straight_stewie
2

2 Comments

TopHatProductions115 12,305

Posted July 30, 2018

Can ACk be handled similar to an Interrupt on x86/PIC, or would that be bad? I'm just a novice here...

Link to comment

Link to post

Mira Yurizaki 13,204

It was being handled that way in the application side, in that incoming ACK requests would immediately cause the state machine to go back to "Tx waiting" and never really get to "Waiting for ACK."

Basically I believe the solution was more or less have sending ACKs out on the same "priority" as retry sending the message. It's important to send ACKs out, but it's also equally important to retry the message.

Sign In

Yurizaki's Tech Ramblings

When something seemingly designed well still has a problem.

2 Comments

TopHatProductions115 12,305

Link to comment

Link to post

Mira Yurizaki 13,204

Link to comment

Link to post

My Activity Streams