This event has ended. Visit the official site or create your own event on Sched.
View analytic
Wednesday, October 5 • 9:00am - 9:50am
Orchestrated Chaos: Applying Failure Testing Research at Scale.

Sign up or log in to save this to your schedule and see who's attending!

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, an increasing number of large-scale sites practice Chaos Engineering, running regular failure drills in which faults are deliberately injected in their production system.  While fault injection infrastructures are becoming relatively mature, existing approaches either explore the space of potential failures randomly or exploit the “hunches” of domain experts to guide the search—the combinatorial space of failure scenarios is too large to search exhaustively. Random strategies waste resources testing “uninteresting” faults, while programmer-guided approaches are only as good as the intuition of a programmer and only scale with human effort. 
In this talk, I will present intuition, experience and research directions related to lineage-driven fault injection (LDFI), a novel approach to automating failure testing.  LDFI utilizes existing tracing or logging infrastructures to work backwards from good outcomes, identifying redundant computations that allow it to aggressively prune the space of faults that must be explored via fault injection.  I will describe LDFI’s theoretical roots in the database research notion of provenance, present results from the lab as well as the field, and present a call to arms for the reliability community to improve our understanding of when and how our fault-tolerant systems actually tolerate faults.

avatar for Peter Alvaro

Peter Alvaro

Assistant Professor, UC Santa Cruz
Peter Alvaro is an Assistant Professor of Computer Science at the University of California Santa Cruz. His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems, in order to make them scalable, predictable and robust to the failures and nondeterminism endemic to large-scale distribution. Peter is the creator of the Dedalus language and co-creator of the Bloom... Read More →

Wednesday October 5, 2016 9:00am - 9:50am
Zilker Ballroom 3+4

Attendees (69)