IROC sponsors and contributes to IEEE IRIC.
Radiation Effects in Solid State Drives
Enrico Costenaro, Dan Alexandrescu, Adrian Evans, Maximilien Glorieux,
Olivier Lauzeral
Abstract
Solid state drives (SSD) are used to store highly critical data in cloud and HPC applications. Silent data corruption (SDC) is a serious concern and in large installations this risk is not negligible. Although data is protected by strong error correcting codes (ECC) when it resides in the flash memory, it is often vulnerable as it transits through the controller chip and buffers. The data may be subject to upsets due to natural radiation (e.g. soft errors) which may result in temporary (hang) or permanent failure (brick) of the drive or result in silent errors (SDC). This presentation, provides an overview of the threat due to soft errors in SSDs. The first part of the presentation describes a standard
accelerated, radiation (neutron) test procedure designed to detect: device hangs, bricks and silent errors. Actual radiation test results for several SSDs will be presented.
In the second part of the presentation, a methodology to analyze the effect of soft-errors in memory controller chips is presented. This includes estimating the risk due to soft errors which occur in embedded SRAM and flip-flops. Participants will learn about the risk of soft-errors as well as low cost techniques for error mitigation.
An Industrial, Arbitrary-Level Reliability Analysis and Management Framework
Dan Alexandrescu, Adrian Evans, Enrico Costenaro, Maximilien Glorieux
Abstract
The dependence of our society on technology is irreversible. Technology is intrinsically unreliable. For many applications, reliability, availability and trustability are key factors, requiring careful design to meet the end users’ expectations. The continuing evolution of the technology allows building increasingly complex electronic devices integrating more and more functions. Electronic systems, now ubiquitous, utilize tens or hundreds of complex microelectronics components, embedding large quantities of standard logic and memory. Moreover, these designs integrate IPs from multiple teams or providers and are implemented in advanced process technologies, making it challenging to evaluate their reliability. Initiatives such as RIIF (Reliability Information Interchange Format) allow the formalization, specification and modeling of extra-functional, reliability properties for technology, circuits and systems. However, analysis is just the starting step of the reliability-aware design process. Reliability optimization and management is required for many application fields and the reliability experts and tools must accompany the designers in devising the best protection methodology for the given circuit/application. We will present an early draft of specifications and suggestions for a high-level Reliability Architect Framework and Toolset (RAFT) with the ultimate goal of supporting arbitrarily-grained systems. The platform facilitates reliability data exchanges through the specifications, design and manufacturing flow and helps the intra-company and inter-company collaboration on reliability subjects. The proposed approach includes reliability data and models, methodologies and tools allowing system reliability exploration and optimization using mathematical models and high-level tools. The proposed approach can be combined with performance management methodologies aiming at reducing the engineering effort devoted to reliability analysis and improvement. The current paper is also meant as a call for contributions. The specifications, development, deployment and future versions of the RAFT platform can only be possible with the help of academy and industry partners.