June 7, 2011

How Do They Do It? Google's One-Server Trick

MERGE User Conference

Audience members at the 2011 Perforce User Conference listened intently as Dan Bloch, a senior site reliability engineer at Google, explained how they run the busiest single Perforce server on the planet – for 12,000 users – not to mention one of the largest repositories in any source control system.

This main server instance has been operating for more than 11 years, Bloch said, processing some 10 million submitted change lists. It has 1 TB of metadata and fields 10-12 million commands and 10,000 submits a day. The server runs on a 16-core Linux box, while metadata resides on solid state disk. While there are 10 smaller servers, the behemoth continues to handle 80% of the load.

A single large depot holds most of Google’s projects – a boon for agile development, said Bloch, as it lets engineers move freely among projects and share code. Google’s own code review tool, Mondrian, is a change list dashboard.

The single largest performance hit for Bloch’s team is database locking – and five years ago it could block the server for up to 10 minutes at a time. Today, that wait time has been decimated.

“We do monitoring with extreme prejudice, because any user can affect the performance of the server as a whole. We kill some commands, such as long-running read-only commands holding locks, or known bad commands. Some users type in p4 files // … and no good will come of that,” he said. However, “killing processes can corrupt your database. It’s only safe for read-only processes. Contact Perforce support before you do this,” Bloch cautioned.

After going through an impressive array of performance tips and tricks sure to be useful to fellow Perforce administrators (see his whitepaper here), Bloch gave a “shout out for Perforce service.” Each update of Perforce software, on its own, improves performance. “Just installing the new revision makes things faster.”

A good relationship is a two-way street, as the Google-Perforce partnership attests: “We couldn’t have done this five years ago. We didn’t know how to run a server of this size and neither did Perforce. We grew into it and Perforce worked with us. Your site is not going to look like this. Figure out your server usage patterns. Knowledge is power.”

Ultimately, he said, “Someone out there is trying to figure out a way to break your server. Do cleanups, know your server, and reduce the size of everything.”