Jason Yee is Director of Advocacy at Gremlin where he helps companies build more resilient systems by learning from how they fail. He also leads the internal Chaos Engineering practices to make Gremlin more reliable. Previously, he worked at Datadog, O’Reilly Media, and MongoDB.
Outside of work, he enjoys drinking whiskey, cooking everything in a waffle iron, and making craft chocolate.
As software systems become more distributed and complex, failure is inevitable. But catastrophic bugs and system outages can be excellent learning opportunities to build more resilient applications. This session will provide methods and techniques for gathering information and effectively using that information to avoid and mitigate failure in the future.
I’ll cover best practices for gathering systems-related data, including monitoring and logging. This session will also cover practices for gathering and recording people-related data; including methods, we can adopt from police, accident investigators, and other safety management professions to learn the most from incidents.