Viz.ai uses AI to analyze CT scans of stroke patients to facilitate faster detection and treatment. This is a performance critical mission where any failed processing may result in harm to patients (1 minute delay equals the loss of about a week of healthy life to the patient). We leverage AWS GPUs to deploy our deep learning algorithms, and use dynamic allocation to control costs as the traffic changes throughout the day. Ensuring the health of a AWS GPU instance is critical.
In this postmortem we will tell the story of how we detected a faulty GPU hardware on an AWS instance, and how we now automatically detect and mitigate such GPU hardware failures to ensure 100% uptime.