0 Replies Latest reply on May 10, 2019 1:45 AM by user_108962310

    CRC32 by hardware peripheral in STM32


      While looking for places to reduce code size, I noticed that the crc32_table takes up 1024 bytes in flash. I looked into the necessity of the table to consider a size/speed tradeoff, but I learned that the STM32 micros have a CRC32 hardware peripheral in them.
      There's no reason that we shouldn't be able to use that, right?
      I hope I am not missing something and duplicating work that already exists in the SDK:


      In libraries/utilities/crc, I changed the implementation to:


      // STM32 chips have an on-chip CRC32 generator

      #ifdef USE_STM32_CRC32_PERIPH

      #include "stm32f4xx_crc.h"

      uint32_t WICED_CRC_FUNCTION_NAME( crc32 )


              uint8_t* pdata,      /* pointer to array of data to process */

              unsigned int nbytes, /* number of input data bytes to process */

              uint32_t crc         /* either CRC32_INIT_VALUE or previous return value */



          // reset CRC for INIT_VALUE

          if(crc == CRC32_INIT_VALUE){



          //otherwise, we will have to assume that the previous result is still in the peripheral CRC buffer

          return CRC_CalcBlockCRC((uint32_t*)pdata, nbytes/4);



      static const uint32_t crc32_table[256] = {

      //...the rest of the existing implementation
      #endif //USE_STM32_CRC32_PERIPH

      And added USE_STM32_CRC32_PERIPH to GLOBAL_DEFINES at the platform level.

      I also had to add a clock-enable command in the platform_init_external_devices() function ... is there a better place to put this?


      /* CRC internal peripheral */

      RCC_AHB1PeriphClockCmd( RCC_AHB1Periph_CRC , ENABLE );


      The obviously shortfall is when nbytes is not a multiple of 4, but in testing, that seems to not have come up yet.

      Testing the two implementations, I don't seem to hit any errors in a debug build, and the DCT survives through a wiced_dct_write() and a reboot.


      Using the cycle counter onCRC_INNER_LOOP for the 32bit case, it looks like it takes 38 cycles per byte, whereas the peripheral should take 4, and potentially save wait states when reading table entries from flash. (maybe the Cortex pipeline takes care of that, I don't know enough details about it)

      And it appears that it runs ~250K-500K cycles quicker than the software implementation, or around 2-5 ms on my F4 platform.

      And the 1K crc32_table no longer appears to be in the output image