Say what you want about Facebook as a social network, but it makes for a fascinating case study in systems design and programming. And its latest bit of tinkering wasn't something it elected to keep close to the vest, but rather chose to submit back upstream.
The problem this time around wasn't the speed of the PHP engine used to build Facebook's pages. (The company's solution to that issue -- the HipHop PHP virtual machine -- was intriguing enough that even the creator of PHP gave it the thumbs-up.) Rather, Facebook's team has been dealing with an even more basic problem: how to manage the colossal codebase used to drive the site, estimated at some 60 million lines of code.
The solution, as outlined in a blog post by Durham Goode and Siddharth Agarwal, was to choose the Mercurial source control system and make selective improvements to it. Last year, the company investigated making changes to Git -- its choice of source control technology at the time -- but claimed that "after much deliberation, we concluded that Git's internals would be difficult to work with for an ambitious scaling project." Mercurial was, to them, "deeply extensible," and the changes the company has made involved building several extensions that have since been offered as open source items.
Among the changes:
Allowing Mercurial to work only with changed files. Facebook's own Watchman file-monitoring service (open-sourced earlier this year) was used to drive this, with Mercurial experiencing a fivefold increase in speed when dealing with file-status changes.
Rewriting some parts of Mercurial in native code for speed. Mercurial is written in Python, but it has some low-level code written in C for speed. Facebook's engineers added C code for handling certain tightly looped events, a common use case for writing in C over Python when possible.
Changing the way clone and pull operations work. Instead of downloading all files on a clone or pull operation, Facebook's revisions, via the remotefilelog extension to Mercurial, only download metadata for the files in question, then download the files themselves when they're actually needed. This not only reduces disk I/O, but network I/O as well.
That said, Facebook's issues with Git have scarcely devalued Git's utility with the rest of the developer community. Back at the beginning of last year, Microsoft doubled down on Git and added support for it into Visual Studio and Team Foundation Server. GitHub itself has slotted in one new feature after another, with visualizing project statistics being the most recent.
Facebook also has released many of its other innovations as open source products: RocksDB (available on GitHub); the aforementioned HipHop PHP VM; the Presto SQL query engine (since adopted by outfits like AirBnB and Dropbox); the Corona scheduler for Hadoop; Flashcache, a caching system designed specifically to work with flash disks to lengthen their lifetimes; and so on. It also makes use of open source, from the PHP language itself to the configuration manager Chef (to which it's also made additions). But the company's still rather secretive about the size of the its server farm.