|Abstract:||Soft errors are a growing concern for processor reliability. Recent work has motivated architecture level studies of soft errors since the architecture level can mask many raw errors and architectural solutions can exploit workload knowledge. My dissertation focuses on the modeling and analysis of soft error issues at the architecture level.
We start with the widely used method for estimating the architecture level mean time to failure (MTTF) due to soft errors. The method first calculates the failure rate for an architecture level component as the product of its raw error rate and an architecture vulnerability factor (AVF). Next, the method calculates the system failure rate as the sum of the failure rates (SOFR) of all components, and the system MTTF as the reciprocal of this failure rate. Both steps make significant assumptions. We analyze the validity of the two steps using both mathematical analysis and experiments. We find that although the AVF+SOFR method is valid for most current systems under current raw error rates, for some cases it can lead to significant discrepancies. We explore scenarios in which such discrepancies could occur in practice.
To find an alternative model that is not subject to such limitations, we propose a model and tool called SoftArch that does not make the above AVF+SOFR assumptions. SoftArch is based on a probabilistic model of error generation and propagation process in a processor. Our experiments show that SoftArch does not exhibit the discrepancies the AVF+SOFR suffered. We apply SoftArch to an out-of-order processor running SPEC2000 benchmarks. Our results motivate selective and dynamic architecture level soft error protection schemes. Next, as another application, we quantify the impact of technology scaling on the processor soft error rate, taking the architecture level masking and workload characteristics into consideration.
By using the SoftArch tool, we observe that there is much architecture level masking and that the degree of such masking can vary significantly across workloads, individual units, and workload phases. Thus, it is natural to consider the architecture level solutions to take advantage of such variations. In order to do that, one would need reasonably accurate estimate of the amount of masking effect in real time. For most current systems, AVF is an accurate abstraction of the architecture level masking effect. Existing solutions for estimating AVF are often based on offline simulators and usually hard to implement in real processors. In this dissertation, we propose a novel way of estimating AVF online, using simple modifications to the processor. Our method applies to both logic and storage structures on the processor and does not require complex offline calibration for different workloads. We test our method with a widely used simulator from industry for SPEC benchmarks. The results show that the method provides reasonably accurate run-time AVF estimates.
To sum up, this dissertation studies the architecture level soft error modeling and analysis problems. It provides new techniques to examine and take advantage of architecture level soft error behavior. We apply our tool to investigate the impact of technology scaling on soft errors. We also propose an efficient online AVF estimation algorithm.