Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE APIs are not thread safe #42

Closed
filipnavara opened this issue Dec 28, 2017 · 6 comments
Closed

BLE APIs are not thread safe #42

filipnavara opened this issue Dec 28, 2017 · 6 comments

Comments

@filipnavara
Copy link

The BTstack APIs used to implement the BLE wiring/HAL API are supposed to be single-threaded. The HAL code, however, spawns a thread that handles the BTstack execution loop. This can be done, but only when certain rules are followed, which is not the case.

Currently the APIs such as "ble.startAdvertising()" are documented to be safe to be executed in main thread in a pattern like

ble.init();
// do some setup
ble.startAdvertising();

This, in turn, triggers the calls to hal_btstack_init (for ble.init) for BTstack to be initialized, which spawns a thread at the end. By the time hal_btstack_startAdvertising (for ble.startAdvertising) is called the thread is already running and continuing the Bluetooth initialization (often the key exchange as part of BTstack's sm_run function).

Since the APIs are not thread safe one of the following things can occur:

  • hal_btstack_startAdvertising calls gap_advertisements_enable(1). It sets the necessary flags and proceeds to call hci_run. hci_run immediately quits on the if (!hci_can_send_command_packet_now()) return; condition because the worker thread is busy sending key exchange packets. In turn, the advertisements are never enabled and there's no error reported.
  • hal_btstack_startAdvertising calls gap_advertisements_enable(1). It sets the necessary flags and proceeds to call hci_run. hci_run goes past the if (!hci_can_send_command_packet_now()) return; check. Now depending on the thread interleaving one of these things can occur:
    • The sm_run key exchange on the worker fails because it can't send a packet now.
    • Both the worker thread and application thread pass through the hci_can_send_command_packet_now checks and end up messing up the command buffers and sending gibberish to the HCI USART communication port.

In some cases, the thread interleaving happens to be lucky and everything succeeds in the right order.

There are some ways to solve the problem, but each of them has its own drawbacks. These are the ones that I have come up with:

  • Do not spawn the BLE worker thread and introduce a ble.waitForEvents() API that will be run in the application thread.
    • Pros: Easy to implement. Easy to reason about thread safety, since BTstack would be run in a single thread as intended.
    • Cons: Breaking API change. Harder to manage for more complex applications, especially when power management gets involved. The BLE provisioning code would need to be changed.
  • Make the BLE worker thread spawning explicit by introducing a ble.startWorker() API. Allow all BLE API calls to be made on the application thread until the startWorker API is called. After that the APIs could only be called in the BLE timers and callbacks, with the exception of ble.deInit() (or possibly ble.stopWorker()). It can be combined with the ble.waitForEvents() API above to offer an alternative and introduce more flexibility for the application writers.
    • Pros: Easy to implement. Very deterministic and in line with how BTstack's official examples are written.
    • Cons: Breaking API change. Requires more understanding by the application developers (possibly a good thing).
  • Create an event queue on the worker thread and process the APIs on the queue along with the BTstack main look.
    • Pros: The APIs remain the same. It doesn't block any application code by delays.
    • Cons: Harder to implement since almost every BLE HAL API would have to be rewritten to be queued (possibly something could be achieved by C macros). Introduces memory management for the queue, which has to handle more states.
  • Implement a synchronization mechanism between the worker thread and the application thread in a way that the pauses the worker thread when an API is being executed and resumes it after. This could be achieved by a mutex that is held on the worker thread and released/reentered at a specified point. The APIs would have to enter the mutex, call the BTstack API and release the mutex, thus allowing the worker thread to proceed.
    • Pros: Easy to implement. No additional memory management.
    • Cons: Feels like a hack. May block the application thread when the BTstack thread is busy processing packets.

Note that this is very much not a theoretical problem. It keeps happening on my initialization sequence in various different ways. I spend days analyzing what is going wrong and I will happily provide more information and relevant logs. The race conditions often manifest as "invalid packet type" or "packet timeout" errors in the HCI code in hci_transport_h4_wiced.c.

@Danappelxx
Copy link

This is a pretty serious issue. Since it's doesn't seem to be fixed in the firmware, what would be a "good enough" workaround on the application side? Wrap all ble calls in a 0-delay timer? Add a simple delay(<some reasonable amount>) after ble.init?

@filipnavara
Copy link
Author

Unfortunately due to the way the callback registration works there's no reliable workaround from application code. You cannot register callbacks before ble.init because it resets them. At the same time it also starts the Bluetooth module and the background thread, so by registering them later you may miss some of the events. There are ways to call the ble functions on the right worker thread, but none of them are reliable during the initialization.

I managed to fry one of my boards by accident so I don't have a way to test this anymore.

@Danappelxx
Copy link

Appreciate the quick response. So you're saying there's not really a way to prevent race conditions? That's pretty unfortunate...

@filipnavara
Copy link
Author

I recompiled my firmware to solve this, but obviously that's not a solution for everyone. There were other problems which I submitted pull requests for (#40, #41) and the latest version of the firmware solves at least that bits. Then RedBear was acquired by Particle and this whole repository was scheduled to get merged into the official Particle firmware (particle-iot#1505). So far that didn't happen. My hope was that the BLE API gets more attention as Particle is going to release their own Bluetooth products (https://www.particle.io/mesh/).

@Danappelxx
Copy link

Danappelxx commented Jul 28, 2018

Are your modifications by any chance open source? I’m already recompiling the firmware to get around other limitations (#45) so it wouldn’t be a huge leap for me.

Edit: Also, I mirror your wishes for the future of particle Bluetooth, hopefully it gets more attention!

@filipnavara
Copy link
Author

I am afraid I never pushed that branch to my GitHub, so it's probably lost for good :-/ I tried all the three approaches above, but in the end I settled on implementing ble.waitForEvents and getting rid of the worker thread. It was by far the easiest one to implement and to reason about its correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants