We have experienced two outages this week with Harness prod-1 cluster. The first outage occurred on 3/30 from 12:30 to 12:55 hours PST lasting 25 minutes. The second outage occurred on 4/1 for 28 minutes from 11:16 to 11:44 hours PST. We have identified the root causes, restored our services as fast as possible and also implemented measures to make sure that we have full service continuity if we end up encountering these issues again.
Harness production has 3 clusters and these clusters are completely separated at the infrastructure. The clusters are prod-1, prod-2 and prod-3. All of our clusters have complete fail over and disaster recovery support both at the control and data planes. The first outage on 3/30 affected all the 3 clusters and the second outage on 4/1 affected only our prod-1 cluster. The entire Harness production suite has 7 micro-services for now and 2 different DB infrastructures. We use Redis for in-memory caching and the Redis runs in separate K8S pods within each of these clusters.
Issue #1 : 3/30
We use Deepfence application firewall to monitor the ingress traffic coming into our K8s clusters. Our K8s cluster has multiple PODs running our microservices and connects to a set of Redis servers which provides caching. At 11:15 hours PST, Deepfence noticed an anomalous packet flowing through the wire that matched the signature of a possible OpenSSL HeartBleed rule set. It flagged it and our policy blocked the connection to Redis in Prod-Primary resulting in ingress and outbound traffic to other services to shut down. We got immediately alerted through our monitoring tools, triggered the incident response protocol and initiated the failover to the secondary nodes. The failover took about 15 minutes, our team did a quick verification and restored the service.
Issue #2: 4/1
Our core services use Timescale DB for some of our product features like custom dashboards and this Timescale is hosted in a separate K8S POD within each cluster. Our application services run a health check protocol periodically to make sure that the DB is available for read and write operations. At 11:16 hours PST, our monitoring tools have alerted that the app.harness.io is not reachable. Our Site Reliability Engineers immediately started to look into the issue and isolated this to failing health checks to Timescale. Timescale DB was found to be healthy and so our team has disabled the health checks and restarted our prod-1 pods which restored the services. As a remediation, we have provisioned secondary servers hosting Timescale DB and enabled automatic fail over to backup if our applications experience health check failures. Our engineering teams have started to investigate the reasons behind the sudden failure of the health check protocol.
Next steps for resilience
- We have initiated a complete audit of all the WAF policy rules across our clusters and introduced a time delay with policy enforcement. There will be a manual intervention from our SRE team to analyze the alerts and the policy enforcement will be taken once we ensure the service continuity. This will be complete in the next 7 working days
- We will also be identifying all non critical services and isolate them to ensure non-interruption to the core services if those services encounter infrastructure failures. This needs some changes in our platform and we will be completing them in the next 10 working days
- We are in the process of optimizing our recovery times to 15 minutes should a production incident occur irrespective of our best efforts and within our committed SLAs. I will be happy to communicate more as we identify and make the necessary changes towards this goal
We are also making changes to expose the cluster information in the account pages and also publish SLAs per cluster and per service. We will make these changes available in production by end of April.