ble busy vs. throughput

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
NiLe_4796031
Level 2
Level 2
10 replies posted 5 replies posted 5 questions asked

I'm having trouble getting a sustained 1.2Mbps from the BLE on the PSoC 63 dev kit. I'm setting it up a connection with a 2M LE PHY and running this hot loop to push out notification packets as quickly as possible:

    static uint32 buffer[NOTIFY_MAX_LEN];  // NOTIFY_MAX_LEN is in bytes, but we allocate uint32 here for alignment

    memset(&buffer, 0, NOTIFY_MAX_LEN);

    notify_packet.handleValPair.value.len = NOTIFY_MAX_LEN;

    notify_packet.handleValPair.value.val = &buffer;

   

    uint32_t *packetno = &buffer[0];

    uint32_t *busyno = &buffer[1];  // 4 bytes into buffer

    while (1) {

        Cy_BLE_ProcessEvents();

        if (sampling) {

            if (Cy_BLE_GATT_GetBusyStatus(notify_packet.connHandle.attId) == CY_BLE_STACK_STATE_FREE) {

                cy_en_ble_api_result_t api_result = Cy_BLE_GATTS_Notification(&notify_packet);

                if (api_result != CY_BLE_SUCCESS) {

                    CY_ASSERT(false);

                }

                ++*packetno;

                *busyno = 0;

            } else {

                ++*busyno;

            }

        }

    }

I'm getting timing problems correlated with 'busy' responses from the BLE stack. The following data is timestamp (seconds of the real-clock minute) p: packetno b: busyno.

29.5292574 p: 0  b: 0

29.5322507 p: 1  b: 0

29.5352564 p: 2  b: 0

29.5372934 p: 3  b: 0

29.5412882 p: 4  b: 59128

29.5432429 p: 5  b: 0

29.5462694 p: 6  b: 3230

29.5492359 p: 7  b: 0

29.5522354 p: 8  b: 5328

29.5552358 p: 9  b: 0

29.5582419 p: 10  b: 3230

29.5602476 p: 11  b: 0

29.5642403 p: 12  b: 5278

29.5662404 p: 13  b: 0

29.5692399 p: 14  b: 3229

29.5722412 p: 15  b: 0

29.5752360 p: 16  b: 5321

29.5782358 p: 17  b: 0

29.5812354 p: 18  b: 3230

29.5872353 p: 19  b: 0

29.5912468 p: 20  b: 5072

29.6480122 p: 21  b: 0

29.6508323 p: 22  b: 6238

29.6528226 p: 23  b: 0

29.6568276 p: 24  b: 54304

29.6590555 p: 25  b: 0

29.7081333 p: 26  b: 3229

29.7109300 p: 27  b: 0

29.7138875 p: 28  b: 46741

29.7158918 p: 29  b: 0

29.7189114 p: 30  b: 3217

29.7219352 p: 31  b: 0

29.7253067 p: 32  b: 5291

29.7282809 p: 33  b: 0

29.7301239 p: 34  b: 3229

29.7331253 p: 35  b: 0

29.7361498 p: 36  b: 5339

29.7391269 p: 37  b: 0

29.7425234 p: 38  b: 3230

29.7445339 p: 39  b: 0

29.7485335 p: 40  b: 5285

You can see at packetno 24 there's a large busyno count of 54304, which then leads to a 0.05 second delay between that and the next packet. Notice how 21 packets were sent during [29.52s .. 29.60s) vs. 5 packets during [29.60s .. 29.70s). The throughput would be fine if not for these occasional large gaps, which in turn cause the throughput to drop by about 300kbps total (measuring average throughput over a minute).

There's a related question BLE stack busy prevents notification sending which suggests changing queue depth via right clicking on something, but I've looked in PSoC Creator and there's no such option. cy_ble_stack.h has a comment with the instructions:

     *  To increase the BLE Stack's default queue depth(CY_BLE_L2CAP_STACK_Q_DEPTH_PER_CONN) and achieve better throughput for the attribute MTU greater than 32,

     *  use the AddQdepthPerConn parameter in the 'Expression View' of the Advanced tab in the BLE component GUI. To Access the 'Expression View', right click on

     *  the 'Advanced' tab in th BLE Component GUI and select the 'Show Expression View' option.

but I don't see any Expression View or Show Expression View. There is an advanced tab but no right-click option.

What can I do to improve the throughput of my single-connection 2M PHY one characteristic notification sending system with the BLE stack single-core on CM4 program?

0 Likes
1 Solution

Hello,

Please check the below points for your application:

1. Make sure that Cy_Ble_ProcessEvents is called at regular intervals in the firmware. Go through the API description in BLE Component configuration for the time interval at which CyBle_ProcessEvents must be called. If any custom function consumes more time for execution, call CyBle_ProcessEvents inside it.

2. Ensure that the BLE subsystem (BLESS) interrupt has the highest priority.

3. Check any continuous flash writes during the BLE connected state. This may result in processing of BLE events to be pending. Try calling the flash write only if the BLESS state is CYBLE_BLESS_STATE_EVENT_CLOSE using Cy_BLE_StackGetBleSsState() function.

Please let me know if this improves the bandwidth.

Thanks,

P Yugandhar.

View solution in original post

0 Likes
7 Replies
VenkataD_41
Moderator
Moderator
Moderator
750 replies posted 500 replies posted 250 solutions authored

Hi,

1. but I don't see any Expression View or Show Expression View. There is an advanced tab but no right-click option.

To enable the expression view, go to Tools--> Options --> Design Entry --> Component Catalog and Enable Param Edit views as shown in the image below:

expn_view.PNG

2. What is the throughput you are getting with PSoC 6 Throughput out of the box code example?

https://www.cypress.com/documentation/code-examples/ce222046-psoc-6-mcu-bluetooth-low-energy-ble-con...

3. Are you using our development kits for both Server and Client? Also tell us what is the distance of separation between Server and Client.

Please try to increase the output power level of TX and see if there is any improvement. If possible please attach your Client and Server projects for us to review the firmware and settings in detail.

Thanks

Ganesh

1. Thank you! I've set the queue depth max to 100 and the behaviour has changed. I'm now seeing long stretches with zero-busy but there still delays without the stack reporting busy:

21.8175873 p: 29  b: 0

21.8205914 p: 30  b: 0

21.8235466 p: 31  b: 0

21.8266331 p: 32  b: 0

21.8520539 p: 33  b: 0

21.8540651 p: 34  b: 0

21.8570655 p: 35  b: 0

21.8600660 p: 36  b: 0

21.8630542 p: 37  b: 0

21.9120541 p: 38  b: 0

21.9150539 p: 39  b: 0

21.9172115 p: 40  b: 0

21.9202087 p: 41  b: 0

21.9230543 p: 42  b: 0

21.9260539 p: 43  b: 0

21.9280571 p: 44  b: 0

21.9716949 p: 45  b: 0

21.9746477 p: 46  b: 0

21.9768011 p: 47  b: 0

21.9798032 p: 48  b: 0

and when the stack does go busy, it goes busy for a long time (subset of packets marked with busy != 0):

22.4665515 p: 104  b: 677682

23.0663724 p: 208  b: 507997

23.6720110 p: 312  b: 511308

24.3261274 p: 416  b: 555815

25.0486755 p: 520  b: 618584

25.4658381 p: 624  b: 339045

26.0515141 p: 728  b: 444960

26.6061265 p: 833  b: 511021

27.1727199 p: 937  b: 476721

27.7750679 p: 1041  b: 508591

Ultimately I'm getting the same throughput for a minute long test run.

2/3. I think we need two kits to run the benchmark test? I've looked at its code closely and copied everything that looked even plausibly relevant. I'm getting a second kit to run the test with to arrive later this week.

The client is a Windows 10 laptop (Dell XPS 15 7590 -- Killer AX1650[1]) which is about 25cm away from the PSoC 6 dev kit. I tried turning up the connection TX power from default 0 dBm to 4 dBm and didn't notice any difference.

Note that the messages are only notifications so there shouldn't be any resending, and the packet numbers ("p:") I'm quoting are in the notification data itself, so any dropped messages would show a skipped counter. I would expect TX power problems to cause dropped packets, not delayed packets? Or is there some other bidirectional communication between the two parties that is renegotiating the channel as long as new data is being transmitted?

[1] Dell support calls it a "Killer 1650x", and I can't find any evidence that Killer supports BLE as opposed to just BT, but I can't find any other communication module on this laptop, so I assume that's it?

0 Likes

2/3. I think we need two kits to run the benchmark test? I've looked at its code closely and copied everything that looked even plausibly relevant. I'm getting a second kit to run the test with to arrive later this week.

Here's the output with the original code:

*****************CE222046: PSoC 6 MCU BLE Throughput Measurement *****************
Role : Client (GATT IN)
**********************************************************************************

Scanning for GAP Peripheral with address: 00:A0:50:AA:BB:FF

Found target device with address: 00:A0:50:AA:BB:FF

Scan stopped as device was found. Initiating Connection...

Connected to Device

Throughput is: 1062 kbps.
Throughput is: 1177 kbps.
Throughput is: 1162 kbps.
Throughput is: 1131 kbps.
Throughput is: 1203 kbps.
Throughput is: 1187 kbps.
Throughput is: 1185 kbps.
Throughput is: 1218 kbps.
Throughput is: 1190 kbps.
Throughput is: 1179 kbps.
Throughput is: 1185 kbps.
Throughput is: 1157 kbps.
Throughput is: 1203 kbps.
Throughput is: 1197 kbps.
Throughput is: 1137 kbps.
Throughput is: 1191 kbps.
Throughput is: 1208 kbps.
Throughput is: 1128 kbps.
Throughput is: 1228 kbps.
Throughput is: 1186 kbps.
Throughput is: 1161 kbps.
Throughput is: 1212 kbps.
Throughput is: 1116 kbps.
Throughput is: 1181 kbps.
Throughput is: 1199 kbps.
Throughput is: 1152 kbps.
Throughput is: 1155 kbps.
Throughput is: 1194 kbps.
Throughput is: 1215 kbps.
The first obvious difference is that I'm running it single-core on CM4. I took CE222046_GATT_Out and set the BLE block to CM4 and replaced main_cm0p.c's main with:

int main(void)

{

    cy_en_ble_api_result_t apiResult;

    __enable_irq(); /* Enable global interrupts. */

    Cy_SysEnableCM4(CY_CORTEX_M4_APPL_ADDR);

    while (1) {

        Cy_SysPm_CpuEnterSleep(CY_SYSPM_WAIT_FOR_INTERRUPT);

    }

}

and that causes a slightly lower bandwidth:

Scanning for GAP Peripheral with address: 00:A0:50:AA:BB:FF

Found target device with address: 00:A0:50:AA:BB:FF

Scan stopped as device was found. Initiating Connection...

Connected to Device

Throughput is: 819 kbps.

Throughput is: 1135 kbps.

Throughput is: 1177 kbps.

Throughput is: 1141 kbps.

Throughput is: 928 kbps.

Throughput is: 977 kbps.

Throughput is: 991 kbps.

Throughput is: 994 kbps.

Throughput is: 938 kbps.

Throughput is: 966 kbps.

Throughput is: 971 kbps.

Throughput is: 968 kbps.

Throughput is: 1001 kbps.

Throughput is: 926 kbps.

Throughput is: 1025 kbps.

Throughput is: 1012 kbps.

Throughput is: 976 kbps.

Throughput is: 1010 kbps.

Throughput is: 1038 kbps.

Throughput is: 1034 kbps.

Throughput is: 1050 kbps.

Throughput is: 1054 kbps.

Throughput is: 1079 kbps.

Throughput is: 1075 kbps.

Throughput is: 1081 kbps.

Throughput is: 1116 kbps.

Throughput is: 1071 kbps.

Throughput is: 1089 kbps.

Throughput is: 1111 kbps.

Throughput is: 1099 kbps.

but still better than I'm measuring through to my laptop.

0 Likes

I tried switching this around a bit and shrunk the notify packet sizes down to 1 byte of user data. To my surprise, the blocks of time where nothing is being transmitted still remain despite the lower total bandwidth!

At this stage, I'd like to know what the CPU is doing that corresponds to those gaps. Even just a lot of samples of PC at random times would suffice. I've been unable to figure out how to do that. I'm looking into two approaches and would appreciate any input on either:

1. software, use a timer-driver interrupt and attempt to read out the thread mode PC from the interrupt handler. Is the previous PC in the LR register in the interrupt handler? How do I read it?

2. hardware, such as ETM or ITM. I don't have any ARM/Cortex debug cables, I have a Saleae logic analyzer (reads voltages only, can not transmit) and a "JTagulator". Do I need to buy a ULINKpro, or is there something simpler I can use?

0 Likes

If possible please attach your Client and Server projects for us to review the firmware and settings in detail.

If you have a github username you can send me then I can give you read permissions to the client project.

0 Likes

I found where the ARM CPU tick counter is (Cy_SysTick, right there in the documentation) and I've started trying to use it to see why I'm getting the stuttering of my BLE notifications.

Here's a sampling where p= packet number (assigned on the sender, not the recipient!) and c= the # of CLK_CPU ticks passed across ProcessEvents():

20.1294059 p: 122  b: 0 c: 131

20.1323932 p: 123  b: 0 c: 131

20.1386672 p: 124  b: 0 c: 131

20.1416650 p: 125  b: 0 c: 131

20.1436711 p: 126  b: 0 c: 10538

20.1473897 p: 127  b: 0 c: 131

20.1500962 p: 128  b: 0 c: 131

20.1520245 p: 129  b: 0 c: 131

20.1970696 p: 130  b: 0 c: 131

20.2000686 p: 131  b: 0 c: 131

20.2031167 p: 132  b: 0 c: 131

20.2054123 p: 133  b: 0 c: 131

20.2084192 p: 134  b: 0 c: 131

20.2110695 p: 135  b: 0 c: 131

20.2144200 p: 136  b: 0 c: 131

As you can see, there's a large # of cycles at p=126 but no delay before it. Conversely, there's a long delay between p=129 and p=130 with no CPU cycles spent, at least not in ProcessEvents.

I did it again but this time measuring CPU ticks from just-before calling SendNotification on one packet to just before SendNotification on the next packet, so that we measure all actions taken on the CPU.

14.8083425 p: 213  b: 0 c: 5466

14.8103384 p: 214  b: 0 c: 5477

14.8133296 p: 215  b: 0 c: 5488

14.8164630 p: 216  b: 0 c: 5499

14.8194692 p: 217  b: 0 c: 5510

14.8217761 p: 218  b: 0 c: 5521

14.8247687 p: 219  b: 0 c: 5532

14.8275906 p: 220  b: 0 c: 5543

14.8483701 p: 221  b: 0 c: 5554

14.8506753 p: 222  b: 0 c: 5565

14.9085797 p: 223  b: 0 c: 5576

14.9107031 p: 224  b: 0 c: 5587

14.9138277 p: 225  b: 0 c: 5598

14.9169681 p: 226  b: 0 c: 5609

14.9196596 p: 227  b: 0 c: 5620

14.9216107 p: 228  b: 0 c: 5631

14.9249577 p: 229  b: 0 c: 5642

14.9270470 p: 230  b: 0 c: 19842

14.9303887 p: 231  b: 0 c: 5889

14.9333992 p: 232  b: 0 c: 5675

14.9363373 p: 233  b: 0 c: 5675

14.9391034 p: 234  b: 0 c: 5686

14.9413608 p: 235  b: 0 c: 5697

There's a stutter between p=222 and p=223 (and earlier, p=220/221) but no CPU cycles spent, and a lot of CPU cycles spent at p=230 with no corresponding stuttering of packets. The other phenomenon is that 'c' is continuously growing, which happens until we get a large number of 'busy' responses (and correspondingly very high CPU cycles spent) from the BLE stack and then it goes back to being low again. But that happens independently of this stutter I'm seeing here.

If it weren't for this stutter, it seems we would be able to get the same throughput on single-core CM4 as we do on dual-core BLE, which is what I'm aiming for.

0 Likes

Hello,

Please check the below points for your application:

1. Make sure that Cy_Ble_ProcessEvents is called at regular intervals in the firmware. Go through the API description in BLE Component configuration for the time interval at which CyBle_ProcessEvents must be called. If any custom function consumes more time for execution, call CyBle_ProcessEvents inside it.

2. Ensure that the BLE subsystem (BLESS) interrupt has the highest priority.

3. Check any continuous flash writes during the BLE connected state. This may result in processing of BLE events to be pending. Try calling the flash write only if the BLESS state is CYBLE_BLESS_STATE_EVENT_CLOSE using Cy_BLE_StackGetBleSsState() function.

Please let me know if this improves the bandwidth.

Thanks,

P Yugandhar.

0 Likes