10 Replies Latest reply on Jan 7, 2015 5:24 AM by legic_1490776

    Strange problem - running firmware gets corrupted? (20732S)

    legic_1490776

      Recently we ran into an unusual failure at our customer's site that  i am trying to debug.

      Unfortunately we have limited data since it was off-site and the problem has corrected itself as I will explain.

       

      The failure mode was that the device would boot up and for the most part work properly, except for one particular feature failing to execute.  I was at a loss to explain this, as the presence of the feature was not subject to configuration, and the system would regularly boot up from deep sleep (hence the problem could not be any weird state in the RAM as it gets cleared on coming out of sleep). I was suspecting a hardware problem with the device since I could not find a reasoning based on software.

       

      Then the customer inadvertently upgraded the device OTA to a newer revision of the firmware, and the problem went away.  The difference between the two revisions was minimal - mostly a change to the version number itself.

       

      This suggested the possibility that somehow, the booting firmware image had gotten changed in some way that caused the weird behavior (but did not crash the system).

       

      So, my question:

       

      Is this even possible?  Does the firmware image have a checksum that is checked prior to boot?

      Is it possible for an application to inadvertently modify the boot image?

       

      This theory doesn't really seem that likely - I would expect a corrupted firmware image to have a high probability of just crashing the device.  But, I also am having trouble coming up with another explanation.

        • 1. Re: Strange problem - running firmware gets corrupted? (20732S)
          BoonT_56

          where are you located? you may want to contact the local broadcom representative for assistance if the issue persists...

          • 2. Re: Strange problem - running firmware gets corrupted? (20732S)
            legic_1490776

            We are located in cambridge, ma, do you have a recommendation on who to contact?

             

            Also, I am still interested in understanding whether it's possible for an application to modify the active fw image and whether there is a checksum.

            • 3. Re: Strange problem - running firmware gets corrupted? (20732S)
              MichaelF_56

              Your local Broadcom Manufacture's Rep is listed below:

               

              Synergy Associates

              85 Rangeway Road

              Bldg 3, Suite 260

              N. Billerica MA 01862
              Tel: 781-238-0870

              ShawnA_01

               

              • 4. Re: Strange problem - running firmware gets corrupted? (20732S)
                MichaelF_56

                I wanted to check in and let you know that we are still escalating this internally with the Business Unit and trying to get some cycles from the development team to investigate.  Thanks for your patience.

                 

                ShawnA_01 jota_1939431 tomc_91 ArvindS_76

                • 5. Re: Strange problem - running firmware gets corrupted? (20732S)
                  legic_1490776

                  Thanks for pursuing it further.

                  The tricky thing with this is that it seems to be a quite rare event... we have only observed it once.

                  Any insight as to what could possibly cause this could be helpful.

                  • 6. Re: Strange problem - running firmware gets corrupted? (20732S)

                    Can you expand on only observing this once?  Is that you only observed it once but your customer observed it more frequently prior to updating the OTA? ...or has this problem only happened one time, period.

                    • 7. Re: Strange problem - running firmware gets corrupted? (20732S)
                      legic_1490776

                      This situation has only occurred once to my knowledge, and it was on a customer's device.  At the moment, there is not a very large population of devices (~150 total) so it's hard to understand how often this will be likely to occur.  Our customer has started releasing these devices to their customers at a rate of 3K/mo so we may start seeing it more frequently.

                       

                      When I say it was observed once, what we observed was what appeared to be a single differently-behaving device, behaving in a way we could not explain.  This unexplained behavior continued to occur across boots of the device; since our app frequently enters deep sleep mode, whenever it woke up it booted up and had this strange behavior.  After a while of trying to understand what was going on, the customer performed an OTA upgrade, and after the upgrade the device resumed behaving normally.  It occurred to us that the strange behavior was related to a failed or partial OTA, but we track attempted OTAs as a counter on the device and the change in behavior does not seem to be related to a failed OTA.

                       

                      The fact that the behavior persisted across boots, but was corrected by an OTA, is particularly strange.  When booting from deep sleep, we restore two data structures containing our prior state from NVRAM, and of course no other state from RAM would persist.  But the same is true of an OTA upgrade: if some bad state was in those NVRAM data structures, it would still be there after the OTA. 

                       

                      Based on all this, it seemed like the most likely explanation was that somehow, the firmware image that is currently active was inadvertently altered in such a way that the system continued to run but exhibited this unexpected behavior.  This led me to ask whether this is possible, or whether it is extremely unlikely to have occurred. If it's extremely unlikely to happen, then we would need to look at other explanations, although currently I don't see what else could explain this.

                       

                      Unfortunately, after the OTA the problem is gone so it is very hard to debug now unless it happens again.

                       

                      So my questions were:

                       

                      (1) is there any OS level protection against overwriting some part of the active image?  Or could some randomly misbehaving code accidentally write the image (this would be outside of the context of our existing OTA code, since we do not believe that code was entered.)

                       

                      (2) is there any sort of checksum on the active image that is checked on boot.. because if so that would seem to preclude this kind of corruption.  unfortunately the OTA occurred before we were able to capture any serial debug output, so if there would have been some serial error output, we missed it. 

                       

                      (3) Assuming that the answer to (2) is no, is it possible for us to implement our own firmware checksum code?  we could run a checksum on the active image on boot and report it -- this way our server side can tell whether the image is OK.

                      • 8. Re: Strange problem - running firmware gets corrupted? (20732S)
                        legic_1490776

                        It looks as if we may have observed this a second time.  It's on a fielded device however so it's hard for us to debug.

                        In this instance, it's possible that it could have happened as a result of a stack overflow which was in a firmware version we released but has since been fixed.

                         

                        Is there any way to verify the firmware loaded on the device?  That is, I would like to be able to checksum the firmware on every boot to make sure it's the right version and has not been corrupted.

                        • 9. Re: Strange problem - running firmware gets corrupted? (20732S)
                          MichaelF_56

                          1.

                          Yes, there is a mechanism in place to prevent the OS from overwriting an an active part of the current image.

                           

                          2.

                          Not on boot as the image has already been loaded.  However, there is a CRC check done during each OTA transfer.

                           

                          3.

                          This is already taken care of as a part of the OTA process.

                           

                          A detailed description of the secure OTA process is available here: WICED Secure Over-the-Air Firmware Upgrade Application Note (SDK 2.x and TAG3 Board)

                          • 10. Re: Strange problem - running firmware gets corrupted? (20732S)
                            legic_1490776

                            mwf_mmfae wrote:

                             

                            1.

                            Yes, there is a mechanism in place to prevent the OS from overwriting an an active part of the current image.

                             

                            2.

                            Not on boot as the image has already been loaded.  However, there is a CRC check done during each OTA transfer.

                             

                            3.

                            This is already taken care of as a part of the OTA process.

                             

                            A detailed description of the secure OTA process is available here: WICED Secure Over-the-Air Firmware Upgrade Application Note (SDK 2.1 and TAG3 Board)

                             

                            I am glad to hear the OS prevents writing to that section, so it should not be possible for a bug in my code to corrupt the active image.

                             

                            I understand that there is a CRC check implemented as part of the OTA code (I implemented this myself in my OTA implementation).  It would still be useful to be able to read what is in the eeprom later to detect a problem (if the OS prevents it, perhaps due to a bug, crash, or hardware failure).

                             

                            Is there a way that I can read (but not write) the active image from the eeprom?  How would I find out the eeprom locations to read?