January 20, 2012

The RAMCloud Project is Intriguing

Community

I've been fascinated by the RAMCloud project being run by a team led by Professor Osterhout at Stanford University. The project has been around for several years (here's the timeline), but has been really gathering a lot of attention and interest over the last six months.

The basic idea of the project is to investigate the implications of a single "what if" question:

What if, instead of storing all your data on disk, and using memory as a cache for the most frequently-used data, you instead stored all your data in memory, and used disk storage only as a backup medium for your memory?

The idea is explained in more details in a long, but extremely clear and readable paper published by the team a few years ago: The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM. The core proposal is defined by the paper as:

A RAMCloud stores all of its information in the main memories of commodity servers, using hundreds or thousands of such servers to create a large-scale storage system. Because all data is in DRAM at all times, a RAMCloud can provide 100-1000x lower latency than disk-based systems and 100-1000x greater throughput. Although the individual memories are volatile, a RAMCloud can use replication and backup techniques to provide data durability and availability equivalent to disk-based systems.

Of course, there are a lot of research questions to be solved before that idea can be made to work. The original paper spends most of its time outlining, in broad strokes, the research directions that the team is pursuing:

  • Low latency RPC
  • Durability and availability
  • Data model
  • Distribution and scaling
  • Concurrency, transactions, and consistency
  • Multi-tenancy
  • Server-client functional distribution
  • Self-management

As with any important long-running operating systems research project, there are a lot of interesting ideas here, and it will take some time before the project is fully fleshed out.

One of the early research efforts has been in demonstrating that one of the core problems (what happens when one of the machines goes down) can be solved, as described in a recent project paper: Fast Crash Recovery in RAMCloud. The authors set out their availability target:

RAMCloud’s solution to the availability problem is fast crash recovery: the system reconstructs the entire contents of a lost server’s memory (64 GB or more) from disk and resumes full service in 1-2 seconds. We believe this is fast enough to be considered “continuous availability” for most applications.

and the remainder of the paper describes the algorithms they are using to achieve that goal.

Good researchers are always looking into the future; the most interesting research, to me, is that which looks just far enough into the future to be enlightening, while not so far into the future to be "pie in the sky". Today's high-end systems are aggressively moving from disk-based storage to SSD filesystems, so it's not at all unreasonable to consider using an entirely memory-based approach, if the problems can be worked out.

I'll be following the work of this project with interest, and thought you might find it worth studying, too!