Symptoms
Users attempting to access the Practice Labs platform were unable to login.
What went wrong
Our database cluster had a spike in memory usage which caused a failover of the primary node to one of the secondary nodes. During this period the cluster become unresponsive.
Who was impacted
All users attempting to login to the Practice Labs platform.
Why it went wrong
Memory exhaustion in our database cluster.
How did we fix it
We have upgraded the memory in all 4 nodes in our database cluster. One node is running with less RAM than we have allocated due to a minor hardware fault which is being addressed by our maintenance provider.
Our database cluster is now operating with reduced processing times, in some cases up to 60% faster with the additional RAM. We have monitored this closely for 7 days and are now comfortable that we can come out of monitoring.