We haven’t lived up to our reliability expectations in recent weeks. We know that we haven’t provided the quality of service that you, our customers, have come to expect, and our European customers in particular have been disproportionately impacted.
Three separate and unrelated incidents resulting in customer downtime in the same week is a major concern. We wanted to give you some insight into what has happened and what we will be doing from this point forward. Firstly, here’s a summary of the incidents:
- On Monday, September 17, we had our first service disruption. This disruption was caused by excessive query load stemming from the optimizer starting to choose an inefficient plan for a query that had previously been performing. Net impact was a 90-minute period in which the 15% of our customers hosted on this database saw serious performance degradation.
- On Tuesday, September 18, we had a second service disruption. This was another database-based disruption but stemmed from a completely different root cause: our ongoing background database reorganization. This impacted about 20% of our European customers for up to 5 hours.
- On Wednesday, September 19, we had a third service disruption. This was again from a completely different cause, a VMWare kernel failure that caused our shared file server to halt. This caused our front-end Web servers to fail and impacted 80% of our customers for 30 minutes.
Zendesk is experiencing continued rapid growth. This creates a constant need for us to expand our capacity. To address this we have been building out our platform and infrastructure to better serve our expanding global customer base. We’ve built out along three dimensions:
- Adding capacity: We’ve increased the number of active servers that host our site by more than 50% in the past 2 quarters, including an 80% increase in the number of databases with production load.
- Bringing on new datacenters: In the past 6 months we’ve brought on line a new data center, which now hosts a growing share of our load. This is part of our long-term strategy of geographic diversity and multi-datacenter customer data storage.
- Improving infrastructure: We’ve re-architected bottlenecks and removed single points of failure.
Organizationally, we have been growing our Operations team and improving our core processes around incident response, escalation process, security, and documentation.
Now our challenge is simple: We need to do even better.
There are many lessons we can draw here. For each of the individual failures, we’ve completed a separate postmortem and identified process improvements, hardware, or software changes. However it’s also helpful to take a step back and ask, what in aggregate does all this mean? What we’ve realized over the past 3 days is that we have many tasks going on simultaneously and that has distracted us from our core mission.
So for us it’s back to basics. We are refocusing our efforts on the core projects, reliability and capacity, and redundancy. We will reallocate our resources to make sure that we complete these projects in a careful but also in a timely fashion. We will defer those projects that are not core to our mission of delivering the highest possible quality service every hour of every day.
Finally, we want to say thank you to you, our customers, for standing beside us as we perform this work. We take great pride in Zendesk’s reliability and feel personally responsible whenever there is a disruption. It has always been our goal to be industry leaders in quality and reliability, and we will work tirelessly both to earn that status and to regain your trust in us. We thank you for your support and understanding.
If you have any questions about the downtime, please feel free to get in touch with me at firstname.lastname@example.org