Question: What are the different ways to mitigate soft errors in Asynchronous SRAMs?
The following methods are commonly used to mitigate soft errors:
- Changes in process technology and cell layout of the SRAM
- Chip design and architecture changes in the SRAM
- System-level design changes outside the SRAM
Changes in process technology and cell layout
A high-energy particle incident on an SRAM cell will generate charge (i.e., an electron-hole pair). The electric field in the depletion region causes the charge to be collected by the junction of the transistor. This results in a disturbance of the current in the affected MOS structure. The restoring transistor tries to balance this disturbance. However, the finite current drive and channel conductance of this restoring MOS induces a voltage disturbance at its drain that can result in an upset. QCRIT is defined as the minimum charge collected due to a particle strike that can cause a soft error. A system with a high QCRIT is less vulnerable to soft errors.
Figure 1. Interaction of a High-Energy Particle on an SRAM Cell
Higher QCRIT can be achieved in one of two ways. You can increase junction capacitance, which requires larger geometries for transistors, or you can increase the saturation current (by lowering PMOS VT), which in results in higher leakages. Process technology and cell layout mitigation techniques come at a cost and are not always feasible.
Changes in chip-design and architecture
Architectural enhancements, such as embedded error correcting code (ECC) and bit-interleaving can be used to limit the effects of soft errors on memory devices.
- Error correcting code (ECC): ECC schemes can be used to detect and correct soft errors. During a write operation, the error correction algorithm incorporates parity bits into each data word. During a read operation, the ECC scheme checks the data and parity bits to detect errors at the accessed memory location. These parity bits require memory cells for storage, and their calculation during read and write could increase the access time.
- Bit-Interleaving: The collision of high-energy particles with semiconductor atoms may affect multiple cells. A multi-bit upset (MBU) occurs when a single energetic particle affects two or more bits in the same word. Bit-interleaving arranges bit lines such that physically adjacent bit lines are mapped to different word registers. Bit-interleave distance separates two consecutive bits mapped to the same word register. If the bit-interleave distance is greater than the spread of a multi-cell hit, it results in single-bit upsets (SBUs) in multiple words instead of an MBU in a single word. In a bit-interleaved memory, a single-bit error correction algorithm can be used to detect and correct all errors. Figures 2 and 3 illustrate an MBU occurrence and the effect of interleaving.
Typical bit-interleave distance depends on process technology. Accelerated neutron testing is performed with a subsequent physical MBU analysis to determine the safe interleaving distance for each process technology node.
Figure 2. Non-Interleaved Memory–Physical Multiple-Cell Upset Resulting in an MBU in a Single Word
Figure 3. Interleaved Memory Array–Avoid MBU by Spreading the Data Word
At system level, soft errors can be mitigated in the following ways:
- Implementing external ECC in hardware
- Implementing external ECC in software
- A triple modular redundancy scheme can be employed to increase system reliability. In this technique, the data from three SRAM devices is read simultaneously and the outputs are fed into a majority voting scheme, returning the value that occurred from the read operation of at least two SRAM devices.
While simple to implement, system-level mitigation using the above schemes enforces usage of larger board area, higher cost, and performance penalties (in terms of delay introduced due to processing overhead for software ECC or a triple modular redundancy scheme).
For more information, refer to the following KBAs: