Your Functional Reliability Partner – IROC Technologies
+33 438 120 763 (FR)
+1 408 982 9993 (US)
info@iroctech.com
Contact Us
Support Site

SoCFIT Case Study

SOCFIT™: Case Study Soft Error Analysis

THE PROBLEM

A large fabless specialized in designing chips that are used in large routers and computer farms, is working on its new design in 28nm.

Its end customer issued a specification for Soft Error performance which is a list of types of radiation-generated failures and the corresponding FIT or Failure in Time. Indeed, each failure can have a different impact on the overall system reliability. Furthermore, each error can be mitigated at various levels, using software procedures, system architecture optimization or simply by improving the reliability performance of the chip.

Fabless’s concern is to meet that reliability performance for their latest 28nm SOC design.

Traditionally post silicon method were used (radiation testing) but it is costly, difficult and long to implement, and happens too late for any type of change. It is a solution for verification, but can’t be used as a proactive optimization process of the design.

Prediction by analysis and prevention are the most cost effective approaches and sometimes the only ones that designers can implement.

Fabless can use SOCFIT™ simulation platform to run such analysis on a gate netlist or even better on the design’s RTL description. Modifications are still possible at manageable cost when the problem is dealt with before synthesis. But RTL is not the only input necessary to run an accurate analysis.

Soft Errors have three main sources: technology, design and application.

Optimization of the combination of the three is necessary to get the best cost effective solution. Sometimes application is the least flexible part as it is the end customer’s property and out of reach for the fabless. RTL is the design part of the input. The technology side is a database of FIT (Failure In Time) rate for each individual cell (SRAM, FF, combinational logic) used in the design for the particular environment the chip will be used in.  They are the intrinsic FIT values of the cells.

THE SOLUTION

SOCFIT results provide accurate and actionable data for the studied design and application. With this set of data, architect and reliability engineers can explain and quantitatively document the design performances to their customers, in a much better way that radiation testing.  SOCFIT also provides mitigation recommendations and characterizations.

For both the initial design and every step of mitigation implementation, SOCFIT reports the total FIT rate (initial and protected design), the various derating factors and the list of major contributors to the overall FIT rate of the design.

Deratings, also known as masking, consider that on a complex design, an upset or error located on one cell might not have create a failure for the chip. It might just stay dormant, go unnoticed or be simply filtered out by the system.

Deratings factors are separated into temporal derating, logic derating and functional deratings. These derating apply to memories, sequential logic like FF and registers and combinational logics.

The first step of the analysis is to run a quick assessment of the FIT rate using a statistical approach. This fast but less accurate step will provide an estimate of FIT and a preliminary list of contributor and deratings values. This first step is very important since it can show if the design is above, close or well below the specification. Furthermore, it prepares for the more detailed analysis involving smart fault injection campaign and application derating on post synthesis designs.

If the first analysis shows that FIT is above the acceptable threshold, or if results need to be more accurate because it is too close to the specification,  then fault injection to simulate different workloads or applications are needed. This approach is usually time consuming to people who are familiar with fault injection techniques in design verification, but SOCFIT actually uses smart algorithm to optimize and decrease the number of cases and time to simulate. The first step (statistical approach) is instrumental to optimize the more detailed analysis and reduce drastically the execution time.

Furthermore, when mitigation solutions are implemented, execution of the next simulation step is much faster as only marginal analysis is performed.

SOCFIT makes it easy for architects and designers at fabless companies to optimize mitigation. Indeed exhaustive data are reported point out to the weaknesses of the design, making the mitigation obvious to choose: whether it is high level software optimization, implementing ECC on specific memory instances or replacing FF instances with more immune ones.

SOCFIT allows for a very quick re-analysis of the mitigated design to confirm the gain in Soft Error rate.