When Facebook built its first company-owned data center in Prineville, Oregon, designing and managing the facility was only part of the challenge. In a blog post Monday, the company explained how it had to stress-test its entire software infrastructure by commandeering a giant cluster of production servers on the other side of the country.
The Oregon data center marked a change of tack for Facebook, which had relied exclusively on two leased facilities in Northern California and Virginia. The Prineville data center was the first to be designed and built from scratch especially for Facebook.
That it could afford to build its own data center at all shows how big Facebook has become. It also shows the pressure that fast-growing social-networking sites are under to keep outages to a minimum. Twitter is also moving into its own data center, citing a need for more control over its infrastructure.
Facebook had never tested its News Feed, search engine and ad network outside of the two-data-center configuration. The company needed to ensure that “our entire software stack would be able to evolve and work smoothly in the new region, without interrupting what our users do every day on Facebook,” Facebook’s Sanjeev Kumar wrote in the company’s engineering blog.
“The solution was to simulate a third region of data centers, even before the new servers in Prineville came online. We called this effort and the simulated third region ‘Project Triforce,’” he wrote.
Facebook took over a cluster of production servers in Virginia and configured them to look like its new “third region.” To do so it built a software suite, called Kobold, that allowed it to “build up and tear down clusters quickly, conduct synthetic load and power tests without impacting user traffic, and audit our steps along the way,” Kumar wrote.
Kobold enabled it to provision and image tens of thousands of servers, and bring them online, in less than 30 days.
“Production traffic was served within 60 days. Traditionally, companies turn up production traffic manually with many people over a period of weeks. Now it takes one person less than ten minutes to turn up production traffic,” Kumar wrote.
He didn’t say if Kobold could be useful to other companies or if Facebook might release the tools for use by others.