top of page

Fault Injection Study on the AMD Evergreen family of GPUs

With their numerous processing cores and their impressive parallel processing capabilities, Graphic Processing Units (GPU) have become the accelerator of choice across multiple domains, from scientific computing, bio-informatics and molecular biology to even financial applications. Their presence in the top supercomputers has been steadily growing over the last few years. With technology scaling, soft errors or single-event upsets (change of state in a device which may lead to wrong program outputs) are becoming a high priority for designers as we’re moving forward. Recently, a study of the Department of Energy has identified soft errors as one of the top 10 challenges to exa-scale computing. We must take measures now in order to come up with ingenious solutions to the problem of soft errors. A key aspect in reliability study is that some soft errors will not cause an error at the output of a program. Therefore, an important step in tackling soft errors in GPUs is to first assess the impact of soft errors and the robustness of the GPUs in the presence of these errors. In this work, we are presenting an error injection study on the AMD Evergreen family of GPUs using a detailed architectural simulator. Our results show that the GPU can be a highly resilient system. We also present a study of some observed trends in the vulnerability of GPU programs and the GPU memory hierarchy. These trends can be further used by programmers as well as system designers when making decisions about GPU reliability.

bottom of page