Zendesk is now 2-3 times faster thanks to our recent upgrade to Ruby 1.9.3!
As our application and traffic continues to grow, we here in Zendesk Engineering are continually identifying and removing performance bottlenecks.
Much to our dismay, we have reached a point where there are very few easy optimizations left within our codebase - the "win 90% from 10%" rule can be tricky with Rails apps due to its tendency towards many, many small methods.
Much of our performance cost was due to Ruby interpreter overhead, and plenty more was garbage collection - our "main app" is over 80,000 LOC, and contains 170 libraries. Even on REE, Ruby 1.8 made us pay a pretty heavy penalty for having a codebase of this size.
A line in the sand
Every once in awhile, one of us in engineering would get ambitious and take a stab at getting some part of our test suite running under 1.9. But we're quite busy, you know, actually building stuff in engineering, and the attempts were spaced out over a few weeks. Which is enough time to allow the rest of the team to add shiny, new, incompatible code. The Sisyphean nature of this would cause us to wander away and get coffee.
When we officially launched the 1.9 upgrade project, we needed a way to prevent our team from introducing new 1.9 incompatibilities while we worked on the old ones.
Here's what we came up with. We started with this simple patch in our test suite
and followed this process:
- Run the test-suite under 1.9, collect failures, and add mark_19_incompat at the head of any test file that doesn't pass under 1.9.
- Get a CI build passing under 1.9. Many parts of the code-base are still broken, but at least we know the entry points that cause failures, and can require the engineering team to keep both "legacy" (REE) and the 1.9 build green.
- The less glamorous but useful work begins: Start fixing and removing mark_19_incompat from tests.
Fixes and common problems
First off, we followed a bunch of the excellent advice from others who have been through the same ordeal, e.g. the Harvest 1.9.3 upgrade was very handy, and the mysql2 upgrade was absolutely crucial. We ended up sticking with syck as our YAML parser in order to remove YAML serialization from our critical path.
UTF-8 Encoding issues
Character encoding will definitely be the big line item when attempting a 1.9 upgrade. This class of errors ranged from simple one-liners, i.e. slapping # encoding: utf-8
at the top of the file, to more in-depth issues with Marshal.load and YAML-parsing - we'll get to those later. Luckily, being of Danish origin, we had a decent amount of UTF-8 in our test suite already - we can't recommend enough that you fill out your test fixtures with plenty of funky looking characters.
This one bit us more times than we'd like to admit. There were quite a few places in our code-base where we called .to_s on a single-sized array, and expected the string back. These bugs often manifested themselves mysteriously. In retrospect, a better approach may have been to raise exceptions when calling incompatible methods.
We ran some tests and found that Ruby 1.8 and 1.9 were bi-directionally compatible with regards to Marshal.dump/load, after adding this little monkey patch (utf-8, again!).
Deep, deep YAML oddness
In 99% of cases, the YAML serialized to the database was easily readable and writable from both 1.8 and 1.9. We came across one very deep nook, though, where certain strings coming back from YAML.parse with binary (known as ASCII-8BIT in ruby-land) encoding. Head scratching ensued. Eventually, we found this odd little nook in Ruby 1.8:
This caused certain UTF-8 heavy strings to be encoded as "BINARY" types in YAML. These types were then assigned incorrect encodings when read back in 1.9. The fix was simple enough, if obscure:
Forgive bad browsers
We were able to reduce encoding error noise quite a bit by falling back to encoding in Latin 1 (ISO-8859-1) when encountering invalid UTF-8 on the front-end. Since the majority of our problems are from users on older clients who are only attempting to browse pages, we have seen very few problems with this simple solution.
Productionizing the thing
Our test suite was passing. Our Soviet-block QA team said "da". We still needed to battle-test the thing. There were a few routes we could have taken here. A smaller start-up might have simply thrown a hail-mary at the upgrade, cutting all the servers over to 1.9 and scrambling to fix the errors. A larger company could have probably dedicated a cluster of servers and mirrored traffic to it, collecting and fixing the errors located. We found ourselves in a somewhat unfortunate middle ground, so we upgraded a single app server and fed it traffic for 15 minutes, collecting all the errors returned.
Our rollout phase lasted roughly a week. Within a few days we were running half of our unicorns on 1.9. This allowed us to keep on top of the list of problems, and users were able to successfully retry requests that failed due to 1.9 issues.
We were surprised by the raw efficiency of 1.9 - our servers, running 1.8, ran pretty hot through the course of the day, with big spikes (likely due to garbage collection). The upgrade eased off our load troubles and gave us some scale-room on the frontend.
We've been in the process of splitting our main application into smaller components. We're also working on upgrading to Rails 3, and hope this will allow us to accelerate the process. Additionally, we''ll be keeping on eye on projects like JRuby and Sidekiq for multi-threaded processing.