“What many people worry about is building these file systems to be reliable, both when they’re operating normally but also in the case of crashes, power failure, software bugs, hardware errors, what have you. Making sure that the file system can recover from a crash at any point is tricky because there are so many different places that you could crash. You literally have to consider every instruction or every disk operation and think, ‘Well, what if I crash now? What now? What now?’ And so empirically, people have found lots of bugs in file systems that have to do with crash recovery, and they keep finding them, even in very well tested file systems, because it’s just so hard to do.”
“In the course of writing the file system, they repeatedly went back and retooled the system specifications, and vice versa,” reported MIT. They “rewrote the file system ‘probably 10 times,’ Zeldovich said. “We’ve written file systems many times over, so we know exactly what it’s going to look like. Whereas with all these logics and proofs, there are so many ways to write them down, and each one of them has subtle implications down the line that we didn’t really understand.”
Frans Kaashoek, a Charles Piper Professor in MIT's EECS, estimated that the research team spent “90% of their time on the definitions of the system components and the relationships between them and on the proof.” He added, “No one had done it. It’s not like you could look up a paper that says, ‘This is the way to do it.’ But now you can read our paper and presumably do it a lot faster.”
Ulfar Erlingsson, lead manager for security research at Google, said, “It’s not like people haven’t proven things in the past. But usually the methods and technologies, the formalisms that were developed for creating the proofs, were so esoteric and so specific to the problem that there was basically hardly any chance that there would be repeat work that built up on it. But I can say for certain that Adam’s stuff with Coq, and separation logic, this is stuff that’s going to get built on and applied in many different domains. That’s what’s so exciting.”
Related research to read until your head explodes
Until their research is published, trying to get a handle on this deep subject is tough; it might include reading “Specifying Crash Safety for Storage Systems” (pdf) which includes numerous links to GitHub. In a nutshell, “Specifying crash-safe storage systems is challenging.”
Sign up for CIO Asia eNewsletters.