I have seen some network problems seemingly related to scenarios where the link goes down and then comes back up some time later.
1. data_abort exception in mbedtls_ssl_close_notify
2. prefetch abort exception in do_memp_free_pool (when the below assert has been removed)
3. lwip assert in do_memp_free_pool LWIP_ASSERT( "Freeing a buffer that is already freed", tmp_memp != memp );
4. lwip thread lockup / blocked in sys_arch_sem_wait(sem, 0), called from tcpip_send_msg_wait_sem
#1 and #2 have mostly went away after adding delays once link up callback is called and verifying the link is good and dhcp is valid prior to network comms (connect).
#3 I have no idea of the root cause but i guess there is a condition where a pointer is being free'd more than once.
#4 This is a hard one. I have seen this rarely in the past but I just had two our of 120 units lock up. Previously I have seen this happen when a tcp connect is attempted. Even though a timeout value is passed through the WICED network layers, there is a possibility of the wait forever for the semaphore in lwip. My application has no idea this is happening and the sytem monitor thread is happy so no watchdog. When this happens, the rest of my app code is running but no lwip code is executing including DHCP renews, renewal timers, course timer, etc.
I believe the last time i saw this with the debugger the call tree was
Basically a message is posted to the tcpip thread but the semaphore is never released indicating the tcpip thread is not running.
Is there anyway to tell if the TCP timers are running or have been disabled from the application code?
I periodically get an assert in do_memp_free_pool, "mem properly aligned" error intervals that would suggest DHCP renewal is about to start. In our testing our network has aggressive DHCP lease times @ 20 minutes which means renews happen every 10. I am seeing at least one device resetting in an hour period. Any ideas?
#4 and #5 are a major issue from a stability point. Our product has slim margin to perform a reset and connect back to the network.
Mainly using TCP + TLS but some UDP as well
My link up/down test triggers more errors when used with multiple Aruba AP's and deleting the ssid and recreating it versus testing with a single Netgear. When a network is deleted on Aruba AP's with a virtual controller it seems like it propagates from AP to AP so a client device may drop one and associate to the next prior to the network being deleted.