TLS failure, ERROR_INVALID_RECORD (5007) , only occurs on certain wifi

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
user_108962310
Level 4
Level 4
10 likes received 10 likes given 5 likes given

I am having an issue with an old version of the SDK, 3.1.2, with devices that are presently in the field. I am having an unexplained issue wherein I get return 5007 from `wiced_tcp_stream_read`, indicating an invalid TLS record.


The TCP+TLS stream is being read in a mechanism whereby a file is downloaded by manually opening a socket, setting up a TLS context, creating a stream, constructing an HTTP request, sending it to the stream, parsing the HTTP header response, and then grabbing the binary body 1024 bytes at a time until the end.

The particularly odd thing is that this code works fine on 95% of wifi AP's, but on a mobile hotspot wifi (including an iPhone 6S), it will hit this error at the same point through the download. The file is ~300KB, and the error will happen at ~80K, repeatably, always in the same length through stream.

I am running on a platform that is *somewhat* resource constrained, with 128KB of RAM. However, again: it works fine on some WiFi AP's, and not on others.

I have tested this with an nginx server that uses both 16KB TLS records and 2KB TLS records, per the `ssl_buffer_size` parameter, and the problem exists for both cases.

I am running it on hardware with a debug build, but no debug or malloc-related breakpoints or hardfaults are being hit.

As far as I can tell, the failure is in `wiced_tls_receive_packet`, line 750, on the call to :
result = tls_get_next_record( context, &record, timeout, TLS_RECEIVE_PACKET_IF_NEEDED );

Unfortunately, I can't go deeper than that, since BESL in 3.1.2 is a binary.

I don't quite know enough about TLS at the moment to try and do debugging around the results within `wiced_tls_receive_packet`.

Any ideas as to why this is happening?

0 Likes
8 Replies
GauravS_31
Moderator
Moderator
Moderator
10 questions asked 250 solutions authored 250 sign-ins

Essentially in the TLS record protocol, the data is divided into fragments where each fragment is optionally compressed, MAC is appended, encrypted and then transmitted as TLS record. The inverse operation is performed to receive data where the MAC is used to verify the data. In your case, it is possible that due to resource constraint in your device, there is not enough memory to fetch the complete record beyond 80kB and the device could have received a "partial" record. As a result, MAC verification could have failed resulting in the invalid record error. Check if freeing up some memory in your device reduces the % occurrence of this issue. You may need to OTA the changes since the device is already deployed in the field.

If the issue is because lacking of memory, it should return ERROR_OUT_OF_MEMORY rather than ERROR_INVALID_RECORD, right?

0 Likes

Yes. Based on the above analysis, the error should have been ERROR_OUT_OF_MEMORY or ERROR_INVALID_MAC. But we could simply eliminate memory allocation as cause of the issue. Alternately he could do two other tests:

1. Capture TLS record packets when the error happens and analyse.

2. Check if the issue happens on non-WICED devices or PCs using those 5% APs. This would help us understand whether the issue lies in those APs or WICED boards.

Would a WireShark capture be sufficient to accomplish #1?

0 Likes

It would depend on the communicating device. In case of Windows PC, winpcap in Wireshark does not support monitor mode. But you can capture TLS record packets going to and from the Wi-Fi interface of your PC. However, if the communication is between two different WICED devices, then winpcap will not help, you would need npcap.

grsr

I wanted to revisit this issue, since it is still a problem for me.

I created a very minimal application, that only uses 35360 B of the 128KB RAM. That includes switching the application main thread to static allocation (12K in total), removing the internal DHCP server, disabling AP mode, maximum 1 TLS session, and retaining only the STA IP stack and setting the others to null pointers.

Unfortunately, I am still up against this problem of hitting a 5007 result somewhere around 82-84Kbyte into the stream. It also always fails at exactly the same spot, and as mentioned before, running with a 'debug' build does not produce any breakpoint traps while running.

Interestingly, if I remove the inline calls to wiced_framework_app_write_chunk (read from stream, write to flash, repeat), then the download succeeds almost all the time.
Is there a TTL/refresh rate component to TLS that needs to be respected by the system?
I have noticed that 3.1.2 uses a max_write_size of 1B for Macronix flashes, which in turn takes 100-150ms to write a 1024 buffer to the sflash.

0 Likes

I can only say it's possible to hit 5007 error in sdk-3.1.2.

I do see such error multiple times.

I'm not sure if Cypress still provide bug fixes to the old BESL library, this needs to be clarified first.

0 Likes

Andrew,

Unfortunately we do not support BESL library. If the workaround stated by you solves your problem, you can consider that. With no TLS record captures, it is difficult to comment on the behaviour of your system.

0 Likes