3.7.0-3 wifi scans failing

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
dast_1961951
Level 4
Level 4
10 likes received First like received

Since using 3.7.0.3, wifi scans appear to not work a majority of the time anymore.

We check for if ( malloced_scan_result->status == WICED_SCAN_INCOMPLETE )  (else) as the snip.scan app does and the message never comes.

mwf_mmfae

0 Likes
34 Replies
dast_1961951
Level 4
Level 4
10 likes received First like received

I do see that most times, the success message makes it to WWD, but not to the scan result callback.

Scan result: channel=0 signal=-66 ssid=1871MEMBER2 bssid=58:97:1e:56:22:24

3241: Event (interface, type, status, reason): WWD_STA_INTERFACE WLC_E_ESCAN_RESULT WLC_E_STATUS_PARTIAL WLC_E_REASON_INITIAL_ASSOC

Scan result: channel=0 signal=-69 ssid=ATT208 bssid=30:60:23:65:22:90

3241: Event (interface, type, status, reason): WWD_STA_INTERFACE WLC_E_ESCAN_RESULT WLC_E_STATUS_SUCCESS WLC_E_REASON_INITIAL_ASSOC

0 Likes
Anonymous
Not applicable

Hi,

I tried the WIFI scan  using WICED 3.7.0-3 at my end using BCM4343W avnet kit and I did not see any issue. All the APs are being scanned same as previous versions. I tried on WICED 3.7.0 and am seeing the same APs.

Can you tell which platform you are using.

Using a custom platform.  (most similar to the WWCD2 devkit)

0 Likes
AxLi_1746341
Level 7
Level 7
10 comments on KBA 5 comments on KBA First comment on KBA

dstudejio wrote:

Since using 3.7.0.3, wifi scans appear to not work a majority of the time anymore.

Do you mean it was working with older SDK version?

If so, please share the debug logs of both old SDK and latest SDK for comparison.

0 Likes

it was 100% working with 3.7.0. 

the debug logs are the same.  Each reports a set of WLC_E_ESCAN_RESULT packets that end with WLC_E_STATUS_SUCCESS. 

the difference in one case or the other is that the callback isn't informed the scan is complete.

0 Likes

Current 3.7.x serial SDK is not usable for us due to the BESL memory leak and other regressions.

So I'm not going to test it right now.

Will carefully verify this once I got a SDK update.

0 Likes
dast_1961951
Level 4
Level 4
10 likes received First like received

mwf_mmfae​ any thoughts?   I notice the issue seems to occur more often when a large # of APs are present.

0 Likes

Please describe the custom platform most similar to the WWCD2 devkit.

Within this forum, we support our developer kits and to some extent those of our partners.

Are you developing with a partner module?

0 Likes

I wasn't aware custom designs were not supported here.  Can you direct me to a route to help support issues on here? 

I realized that the networking worker thread had a very small (16 position) queue.  the scan, being well over 16 APs, overflowed this queue and the "success" message was rejected.  I've increased the size of the queue and this now works 100%.

0 Likes

I would need to understand what you mean by custom.  All designs are custom for the most part from an application perspective.

However, the expectation here on the community forum is that you are user either our development kit and/or one from a module partner, along with a production module from that module partner as well.

So if you some how were developing with an SoC, then I would have to direct you back to the local team at Cypress that signed off on the engagement so that they could line up factory support.

0 Likes

thank you for response. at this point the platform isn't important the

networking worker thread default message queue size is 16 elements which

causes issues when scanning than 16 wifi aps.

0 Likes

dstudejio wrote:

thank you for response.  at this point the platform isn't important   the

networking worker thread default message queue size is 16 elements which

causes issues when scanning than 16 wifi aps.

The message queue size is not the limitation of the number of APs you can scan.

That are totally different things.

well, each scan is sent directly to network worker thread from wwd, which

then does the callback. when there are more than 16 ap scanned, the queue

fills and the remainder of the messages don't get queued. for me,

increasing the message queue size helped the problem but i think a better

solution would handle any number of AP responses.

0 Likes

If you want to check if it's really queue size issue,

add a debug code to the code enqueue the message so you will know

if the enqueue fails or not.

BTW, I'm sure I can scan more than 16 APs without modify the queue size.

dast_1961951
Level 4
Level 4
10 likes received First like received

I'm running into an issue that seems new in 3.7.0-3.  I haven't analyzed what actually is causing the new issue, but it doesn't occur when my code base is reverted to 3.7.0.

I noticed when scans were completing successfully, there were always less than 16 APs reported.  I looked into the callback mechanism and found that it was using the networking thread and that the the queue size was 16 items.  I hypothesized that this was the problem and increased the queue size to 40.  After this change, the scan callback chain was completing for much larger numbers of APs reported.  I think in the area I was in, it was around 30 APs.  I changed the queue size back and again the scans failed.

I admit the dynamics of my application may be different than the snip.scan application and perhaps the other platform, but I believe the mechanics of the issue constitute a race condition.  The networking thread needs to process scans faster than the WWD can queue it up in order to handle larger numbers of APs.  This issue will be troublesome for other applications than my own.

0 Likes

dstudejio wrote:

I admit the dynamics of my application may be different than the snip.scan application and perhaps the other platform, but I believe the mechanics of the issue constitute a race condition.  The networking thread needs to process scans faster than the WWD can queue it up in order to handle larger numbers of APs.  This issue will be troublesome for other applications than my own.

Does snip.scan work without modifying queue size?

I don't want to jump into a conclusion about the fix for the problem.

(At least, it currently looks does not make sense to me about changing queue size.)

I remember you said it 100% works before 3.7.0-3 SDK.

In older SDKs, the queue size is the same.

I'd rather to figure out the root cause than quickly fix it with a workaround.

0 Likes

I agree with you that the queue resizing is an undesireable workaround.  Unfortunately I'm not in a location for a couple days that has that large number of APs available so I can't do more testing on this issue for a bit.

However, I don't think any more testing is necessary.  If you review the mechanism and design of the WWD scan process, it's clear the is not dependent on my application.  The WWD is feeding scan items to the network queue faster than it can handle and the queue is too small to handle the number of APs that were in my location.  IMO this is the root cause.  The issue was present, but not exposed, in 3.7.0-3.  My best guess based on reviewing the differences for 3.7.0-3 is that introduction of semaphores to the wifi scan process altered timing enough such that this issue was exposed.

0 Likes

BTW, which network stack are you using?

0 Likes

for my application, NetX Duo.  The issue appears without any IP network up.

0 Likes

dstudejio wrote:

for my application, NetX Duo.  The issue appears without any IP network up.

I never use NetX Duo, I use LwIP.

Anyway,

Add the code to check if you usually enqueue failure: (In WICED/internal/wifi.c )

if ( wiced_rtos_send_asynchronous_event( WICED_NETWORKING_WORKER_THREAD, (event_handler_t) scan_handler->results_handler, (void*) ( result_iter ) ) != WICED_SUCCESS )

{

// add your debug code here..

}

0 Likes

Sorry - I'm not in the test env where I can accomplish this right now. 

I have to close on this investigation for now.  I am 100% certain at this point based on my work that you will find that the debug code you mentioned will run. 

I'm happy to discuss any solutions to remove the race condition.

0 Likes

Realized that one of the things that exacerbated this issue is that I had a higher priority task pretty much blocking the networking thread.  With the networking thread unblocked, it seems to service this queue just fine.  I'm still a little concerned that this design doesn't take into account queue fill up (for other real time loads).

0 Likes

I am encountering the same issue.  We have over 40 access points showing up and we never receive a scan complete status.  With the introduction of the semaphore, snip.scan never proceeds past the first scan.  We see the same problem in our application.  Changing the semaphore timeout to 10 seconds provides a temporary workaround for us, but I think a proper fix is warranted.

0 Likes

indeed.  i just got word that a major release (wiced 4.0???) is due any day now.  might fix this.  I fixed this by making sure my higher priority threads weren't dominating & allowing the networking thread to process.

0 Likes

dstudejio

Please check your inbox for the invitation I sent.

djjw

0 Likes
Anonymous
Not applicable

I fixed this by making sure my higher priority threads weren't dominating & allowing the networking thread to process.

> If you have a high priority thread dominating, then I think fixing it it is the right approach. You may have lot more problems than with just the scan results.

agreed - the task priority change definitely helped with overall stability.

0 Likes

This issue has been fixed, the queue handling the scan result in wifi.c was getting over filled. The next SDk will have this fix.

vik86 wrote:

This issue has been fixed, the queue handling the scan result in wifi.c was getting over filled. The next SDk will have this fix.

Can you post the fix so people can verify it right away?

0 Likes

vik86 wrote:

This issue has been fixed, the queue handling the scan result in wifi.c was getting over filled. The next SDk will have this fix.

Can you just post the fix?

I think the best way to fix the issue reported on the forum is posting the fix rather than asking user to test next (not yet released) SDK.

This is the best way that reporter can review and verify it before new sdk release.

Otherwise, it's possible still has issue on next SDK, in such case people has to wait yet another next SDK.

0 Likes
lock attach
Attachments are accessible only for community members.

axel.lin

Attaching a temporary early access fix. Replace this and <WICED_SDK>/WICED/internal

Somehow, I feel uncomfortable about the changes, I think the fix is *wrong*.

Despite some pointless rename (scan_handler->in_scan_handler) which makes the diff bigger

and a unnecessary change to add g_scan_semaphore_and_timer_inited flag.

wiced_wifi_init() is supposed to be done before scan, I have no idea why you need g_scan_semaphore_and_timer_inited flag.

The main changes is:

Now you have 2 paths to call scan_handler.results_handler(scan_result_ptr);

One in WICED_NETWORKING_WORKER_THREAD and the other one is directly called by wiced_wifi_scan_networks_ex().

It's possible your directly call to scan_handler.results_handler() handle scan_complete before WICED_NETWORKING_WORKER_THREAD handle the rest of scan_results.

The original event queue guarantee the order to process the scan result so scan_complete will always be the last one.

I think the new code is worsen than original behavior.

I am just wondering if it is the same problem I see when too many AP's are present.  I didnt read the entire thread but I saw something similar and just wanted to confirm. Also are you alon dual band or 2.4G only.  i think WCD2 board is dual band?

0 Likes

Thank you for your explanation - I'm sorry if I misinterpreted.

I'm working only on 2.4G.   But the env I am testing in has 30+ APs, even at 2.4Ghz.  But many are in marginal range, which makes the issue appear intermittently (when less than 16 aps are reported).

0 Likes