Resilience versus Performance in Numerical Linear Algebra

01.03.2016 - 31.08.2020
Research funding project

Algorithmic fault tolerance is an important open problem for complex numerical algorithms. Substantial
advances will have high impact in a wide spectrum of applications – ranging from sensor networks over
P2P networks to high performance computing (HPC), since future HPC systems are expected to exhibit
much higher fault rates than current systems do. It is an open question how much resilience can be
achieved at the algorithmic level and how this influences sustained performance.
The REPEAL project investigates resilient/fault-tolerant parallel algorithms for a range of numerical linear algebra problems. The objectives are to design algorithms which provably produce accurate results (within the limitations of floating-point arithmetic) in the presence of faults, and to gain a better understanding of the resilience-performance trade-off. The focus will be on node failures and silent faults (bit flips), which are particularly difficult to handle at the algorithmic level.
We will address the following questions: How can existing approaches be improved to handle more general temporal and spatial fault distributions? Which numerical accuracy and sustained performance do resilient algorithms achieve in real computations? Which resilience improvements can be achieved by combining deterministic with randomized (gossip-based) approaches? What is the "price" of resilience, i.e., which slow-down has to be expected compared to non-resilient high performance algorithms?

People

Project leader

Institute

Grant funds

  • WWTF Wiener Wissenschafts-, Forschu und Technologiefonds (National) Vienna Science and Technology Fund (WWTF) Call identifier ICT15-113

Research focus

  • Distributed and Parallel Systems: 20%
  • Computer Science Foundations: 80%

External partner

  • Universität Wien

Publications