Production Stability Updates

3 min readMay 14, 2021

I would like to update everyone on where we are with some of the stability improvements our teams are working on since our last production incident. Delivering happiness to our customers is a #1 priority for us and our teams continue to work round the clock to deliver new modules, enhancements, reducing technical debt and improving the quality of the deliverables and our service availability.

Production infrastructure audit and changes

We have completed a full audit of our WAF policies and made some changes to the enforcement rules so that we don’t auto shutdown services or interfaces even if there is an attack on our infrastructure. This was one of the root causes of an incident that brought our service down in the recent past. All of these changes are tested and are enabled in all of our clusters.

Chaos testing

We have started the practice of performing chaos engineering on our production in maintenance windows. This is an important exercise that any SaaS platform to perform periodically. You just have to expect that every one of your sub system will go down at random times (and in random order) in production and have contingency plans either to gracefully fail over or in remote cases, bring the service back in few minutes. Our team have brought down the data plane (Mongo, Timescale etc.) completely, our Redis cluster, and some services in control plane and verified that we either failed over or maintain partial service availability for few mins till we transition to the backup. We are going to continue this exercise for the rest of the services and cycle through them every 90 days.

Structural changes to the status pages

As explained in our previous engineering blogs, we have 3 clusters in production and in most cases, an incident in one cluster does not impact the other cluster. It’s important to provide that visibility to our customers. We now expose the cluster information and you can find out which cluster you are in.

More changes are coming to enable you to subscribe to status updates on specific cluster.

Quality enforcement

We now have multiple reviews on index changes and more specifically, anything that is going to trigger a table scan as part of the production release sign off. In addition, we are working on adding more data sets to our QA environments— automation and trying to replicate production data sets in the QA environment and cycles are weapons that help but they are imperfect in many other ways. We continue to explore ways to find issues much earlier in the sprint cycles — the right place to contain them is before a code change gets approved to be merged. We have a very stringent quality enforcement process in place already.

Architectural changes

We are working on completely isolating clusters and that needs some architectural realignments. As explained earlier by Surya Bhagvat, there are two services that are shared across all the clusters — gateway that performs authentication and our UI micro service. We are now working on isolating UI services to be contained in each cluster — this will let us run different versions in each cluster if needed and also isolate issues to specific clusters. Once the UI service isolation is done, we will look into moving the gateway as well and introduce a global look up service to find out the home cluster for a user and route the traffic to the home cluster. These changes will allow us to scale vertically while providing the needed fault isolation.

I will provide an update on the architectural changes in few weeks once we complete them.