Behind the scenes: Yellow alerts at Cloud9
By Tim Robinson • 12 May 2015
Alert Systems + Priorities
Here at Cloud9, performance is our #1 priority. We’re all efficiency addicted coders ourselves and understand how frustrating it can be when your primary development tool freezes or has issues. Even the smallest amounts of lag sends us into fits of rage.
Over the course of 3 weeks this past February, we raised a yellow alert at Cloud9. While a red alert is generally well known in tech circles (system is down and must be immediately fixed even if it’s 3am on a Sunday) a 'yellow alert' is a fairly unknown term. A yellow alert is where we all focus completely on fixing one issue before moving onto any new features or improvements to the product. In this case, we wanted to tackle performance and stability issues before they became serious problems.
We’ve been growing quite significantly over the last year and discovered our infrastructure wasn’t scaling as rapidly. Although we weren’t experiencing significant issues (still almost 100% uptime) we were seeing users becoming disconnected and experiencing lag far more frequently than we’d like.
What Gets Measured Gets Improved
The first step in our plan was building the ultimate metrics dashboard. We wanted one screen where we could see everything happening in real time and be immediately notified of issues.
Although we previously had hundreds of different graphs and monitors for server side events in Datadog we were lacking in both client side data and having a good overview dashboard of core performance metrics.
It started off simple enough and has evolved over time, showing us overall operational health at a moment’s glance. We’ve color coded most items where red is bad, yellow is acceptable and green is good (fast).
As of now, our office has 2 (soon to be 4!) big 50” screens that show this dashboard along with a dashboard of our customer support section all day, every day. Even the non-coders can glance at this board during the day and understand how we are performing.
If Games Can Prevail, Then So Can We
One of the most challenging issues working in the cloud is the latency between you and the server. But this doesn’t mean you’re going to have a bad time. If professional gamers can compete in tournaments that require millisecond response times over the internet then we too can make Cloud9 a smooth and lag free experience.
Our first task was figuring out which people have high ping times and why. To do this we started recording the pings of all our users every few minutes and saving out both averages and ping spikes.
We added this by modifying the client side code to ping the backend server every 10 seconds and record the time it takes for a response. Then every 5 minutes the client sends all the pings to an endpoint we setup on the server. This endpoint collects all the pings, calculates the average and max of the pings and sends each data point off to Datadog
We soon discovered that there were some issues with this method. On occasion clients would get lag spikes causing their pings to look something like this:
[20, 21, 19, 2452, 20]
Now taking the average of these numbers wasn’t going to give us very good results. So we created a variable called pingNormalized that calculates the average, then disregards all pings over 3 times the average, then calculates the average again. This gives a much more accurate view of the actual latency users were experiencing.
We then correlated them with their IP’s to create a map of every IP address and how much latency they were experiencing. Here's what it looks like with an hour of data:
With this graph we can filter by response time, time of day and date period for our logs, seeing all users that match the criteria on the map. In the future it will help us figure out where we need additional servers and datacenters.
Next we made areas where you don’t need to talk to the server as fast as possible. Using the editor, scrolling through search results and adjusting settings don’t need instant feedback from the server so they are done completely client side. Things such as the terminal have text prediction so even if you have a crappy connection it appears to be as smooth as using a local terminal.
Finally, when running our stats on user locations we noticed some users in Europe had workspaces in the US and vice versa. We used to use latency based routing and on the odd occasion users would get a lag spike to their closest datacenter causing their workspace to be created in a region further away. We’re now running a script that scans the database for any user whose workspace is in the wrong region and moves it automatically. This will also ensure when we add more datacenters in the future that your workspace is always as close to your location as possible.
How Adding More Stats Killed Our Stats
Two weeks into the yellow alert we hit a major issue. Our ping times, disconnects and overall metrics were looking better and better, until suddenly, they weren’t… One Tuesday morning upon returning to the office we noticed almost all users pings were getting worse and disconnects were getting even more out of control than when we’d started.
This issue was made worse by us doing a huge release on Monday, with many commits any of which could have broken the system. We eventually traced the issue to a change in the way workspace stats are calculated which increased our database load and network activity by almost 400% and ended up slowing down almost every other part of the system.
We quickly reverted this change after it was discovered and from the dashboard could see ping times and disconnects drop significantly and our metrics started to look better than ever.
The moral of the story is we’re extremely thankful for having tracking and a dashboard in place before this issue happened or it may have gone unnoticed for weeks or even months and all our hard work would have been for naught.
Disconnects are Deal Breakers
Being disconnected or having your IDE lock up just as you’re getting into the flow can be as jarring as your cat deciding the start of your workday is the perfect time to jump on your keyboard and play with your mouse.
So of course this must be fixed. We set to work firstly identifying the causes of disconnects and secondly ensuring that when they happen they have as little impact as possible.
For example, you don’t need an internet connection to edit your files or change settings in Cloud9. So we now allow users to do these actions while disconnected and we can simply re-sync once they get back into the IDE. We also made the disconnect box less intrusive and allow users to manually reconnect instead of having to wait for the IDE to discover when it’s online again.
In finding causes of disconnects we discovered several of the libraries we were using were out of date. We also dove deep into engine.io and the other socket libraries to debug every disconnect reason (mostly heartbeat timeouts) and correlate that with users latencies to get a good overview of every users performance and why they were having issues. We then set about either fixing or investigating every disconnect issue as deeply as possible. By the end our disconnects were down to a third of what they used to be and almost all remaining issues were caused by users’ flaky connections.
A Preview Faster than Github?
When rapidly iterating on a design having a fast page preview is critical to working at maximum efficiency. So when we discovered some users were experience loading times of up to 5 seconds when using the Cloud9 preview tool we knew something had to be fixed.
Two of my colleagues, Fabian and Matthijs, set to work on optimizing the entire preview flow. They had to rip out our old node file server and replace it with nginx (a service far more optimized for serving static content) and measure and improve every single function call.
But we weren’t satisfied with just a faster preview system, we needed to know it was the best in its class. So we started comparing it to Github’s page preview speed and ensured ours was just as fast- if not faster! We now have our loading speed alongside Githubs on our office dashboard to ensure preview is always working at top speed.
From Idea to Work in < 10 Seconds
When you have an idea for a hot new project the last thing you want to do is wait around for it to create and load. So we optimized the entire project creation flow, making tasks run in parallel and cut all unnecessary tasks. Now you can get to work on a new idea in under 10 seconds without having to configure boilerplate, dependencies, and software. Plus workspace loading is now faster than ever, you may find your project opens even faster than it does on your local machine. That’s the power of Cloud9 - we run the expensive hardware and do the heavy lifting so you're as efficient as you possibly can be.
Ongoing Improvements is the Name of the Game
Since we’ve scaled up, other issues have developed at weak points in our infrastructure. One of these was file saves on occasion timing out, which we measured, tracked and improved. The second was event loop blocking on our NodeJS servers. As you may know NodeJS is single threaded so if anything takes too long to process it will lock up the entire server, so finding these blocks were of critical importance. We’ve found and fixed almost all of these and will continue monitoring every day ensuring things are always getting faster and better for you. Having developers always working at peak efficiency is one of our biggest goals and through this yellow alert and ongoing improvements we’re going to continue delivering on that promise.
We are in the business of helping you do your work better, faster, and with less distracting problems. We want to support you to be more in your flow. You can rest assured that our commitment to giving you the best experience possible means that we are constantly solving issues like this behind the scenes.